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Forevvord 


Score reports have received short shrift in the measurement research literature and in mea- 
surement practice, their design often perceived as secondary to traditional measurement con- 
cerns, such as reliability and validity. In practice, score report design sometimes seemed like an 
afterthought, attended to only after assessment development, score modeling, or other “more 
pressing” issues. This attitude has alvvays seemed shortsighted to me. 


e Ifvalidity is a property of the inferences that score users make about tests and test results, 
then should not the lens through vvhich score users perceive the test results—the score 
report and other test information—be a central concern of everyone in the assessment 
field? 

ə Score reports are the public face of an assessment. VVhen the public seeks information 
about an assessment, score reports are the most visible, and they can influence public 
opinion. VVith so much negative nevvs about assessments and the testing industry in the 
nevvs of late, should not the academic, government, and corporate assessment vvorld be 
more careful about the design of score reports? 


Because you have picked up this book, you likely agree that score reports are important. Perhaps 
you create assessments, or othervvise vvork in the assessment field, so you feel a responsibil- 
ity to ensure that the person reading the report understands the information to be conveyed, 
especially the lack of precision. After all, you are a testing professional and the people vvho are 
reading the score report likely are not. They do not understand statistics, measurement, or vhat 
goes into creating an assessment. Thus, clarity of statistical presentation and explanation of the 
limits of dravving inferences based on point estimates (scores) should be most critical, right? 

VVrong. VVell, sort of vvrong. Those aspects are important, but a score report should be more 
than ust a score, a comparison vvith others scores, and fine-print vvarnings for the reader (albeit 
an unfair characterization of some score reports, but a fair description of many others). A score 
report serves a purpose for the reader: to make a decision or to dravv an inference about a test 
taker or group of test takers. Design of score reports must be driven by that purpose and, there- 
fore, driven by the needs of the score user: the student vvho vvants to learn, the parent vvho vvants 
to help their child learn, the teacher vyho vvants to guide his or her class, the administrator vvho 
vvants to argue for more resources. 

"The best score reports are tailored to the needs of the score user, vyhether the intended audi- 
ence is students, parents, teachers, administrators, or other stakeholders. Hovvever, hovv does 
one do that tailoring? As someone vvho has vvorked in the testing industry for almost 30 years, 
Iknovr that such design is tough. VVhen 1 vyorked on a nev assessment, people vvanted to knovv 
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“yyhat am 1 going to get from this assessment?” "They vvere sometimes interested in seeing sam- 
ple tasks, hearing about the underlying construct, and learning about our validity research, but 
they alvvays insisted on seeing the score reports. VVe had tvvo reports—one tailored to the test 
taker and tvvo for the institutions—that vvere created using some of the techniques (e.g., pro- 
spective score report, audience analysis, iterative design) discussed in the book you nov: hold. 

I found this book to be an excellent exploration of the design challenge of tailoring score 
reports. Especially useful is the advice on design processes and on understanding the test user, 
vrhich may be found in many chapters. Several chapters cover deliberate, systematic approaches 
to the score report design process that puts the score user—the intended audience—front-and- 
center. To design for a particular audience requires some knovvledge of hovv people comprehend 
and process information. Thus, some chapters cover hovr people perceive score reports, vvhether 
accounting for hovr people comprehend and process visual inputs, or vrhat inferences people 
dravv from different types of displays (and vvhat incorrect inferences they are likely to dravv). 
Designing useful score reports for particular audiences can be tough, but the experts that editor 
Diego Zapata-Rivera recruited to contribute should help anyone vvho vvants to create better 
score reports. 

"The contributors to this book represent some of the multi-disciplinarity that should go into 
score report creation. Certainly, there are psychometricians, particularly measurement experts. 
There are also cognitive psychologists, experts in validity theory, computer scientists, education 
researchers, data scientists, and statisticians. It takes combinations of these disciplines to under- 
stand hovr test users make meaning from a score report, hovv they dravr inferences, and hovv 
they make decisions. Building on this understanding, these different disciplines have much to 
say about hovr to design tailored reports that ensure that the inferences and decisions test users 
make are appropriate: that they reflect the strengths and limits of the information presented in 
the score report. 

VVhether you are a researcher, practitioner, or both, Score Reporting Research and Applica- 
tions vvill provide you vvith much to think about regarding, and concrete recommendations for, 
the design of score reports that help score users make appropriate, evidence-based decisions. 


Irvin R. Katz 
Princeton, N/ 
March 28, 2018 
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Introduction 
VVhy Is Score Reporting Relevant? 


Diego Zapata-Rivera 


Research in the field of score reporting has increased considerabİy in the last decade. This area 
of research is no longer limited to investigating the psychometric properties of scores and sub- 
scores but instead its scope has been broadened to include aspects such as: designing and eval- 
uating of score reports taking into account the needs and other characteristics of particular 
audiences, exploring appropriate use of assessment information, investigating hovv particular 
graphical representations are understood by score report users: designing support materials to 
facilitate understanding and appropriate use of score report information: designing and eval- 
uating interactive report systems: and foundational research on the cognitive affordances of 
particular graphical representations. 

Validity is a property of the interpretation and use of test results (AERA, APA, 8: NCME, 
2014: Hattie, 2009, Kane, 2006). By studying hovv particular audiences understand the intended 
messages conveyed by score report information, vve can devise mechanisms for supporting valid 
interpretation and use of score report information. VVith this book, vve bring together a group of 
researchers vvorking in various areas relating to the field of score reporting including research- 
ers designing and evaluating score reports in the K-12 and higher education contexts and those 
doing foundational research on related areas that may inform hovv to communicate assessment 
information to various audiences. 

Papers in this volume build on a grovving body of literature in the score reporting field that 
includes vvork on framevvorks for designing and evaluating score reports (e.g., Hambleton 8z 
Zenisky, 2013, Zapata-Rivera 6: VanVVinkle, 2010), approaches for tailoring score reports to 
particular audiences (e.g., laeger, 2003, Zenisky 6: Hambleton, 2012, VVainer, 2014: Zapata- 
Rivera 6z Katz, 2014), and evaluating score reports for teachers (Rankin, 2016, Zapata-Rivera, 
Zvvick, $ç Vezzu, 2016, Zvvick, Sklar, VVakefteld, Hamilton, Norman, 8: Folsom, 2008, Zvvick, 
Zapata-Rivera, 6: Hegarty, 2014), parents (Kannan, Zapata-Rivera, 6: Leibovvitz, 2016, Zapa- 
ta-Rivera et al., 2014), students (Goodman 6: Hambleton, 2004: Vezzu, VanVVinkle, $ç Zapa- 
ta-Rivera, 2011), and policy makers (Hambleton 8: Slater, 1997, Undervvood, Zapata-Rivera, 6z 
VanVVinkle, 2010, VVainer, Hambleton, 8: Meara, 1999). 
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A Balance of Research and Practice 


This volume is divided into tvvo sections, providing a balance of research and practice. The 
first section includes foundational vvork on validity issues related the use and interpretation 
of test scores, design principles dravvn from related areas including cognitive science, human- 
computer interaction, and information visualization and research on presenting particular types 
Of assessment information to various audiences (e.g., subscores, grovvth, and measurement 
error information). The second section provides a select compilation of practical applications of 
designing and evaluating score reports in real settings. In aggregate, the papers describe current 
vyork on vvhat assessment information to present and hovv to present it to particular audiences 
for various purposes. Altogether, this volume provides interested readers vvith a unique source 
of current research and applications in the area of score reporting. 


Part 1. Foundational VVork 


"This section includes five chapters. In Chapter 1, Tannenbaum focuses on validity aspects of 
score reporting. The author makes an argument for clearly communicating score report infor- 
mation, since the ability of the stakeholders to make appropriate decisions is dependent, in part, 
on the relevance and accuracy of the assessment results, and the ability of the stakeholders to 
understand the reported information in the vvay intended. This chapter highlights key concepts 
and practices that are intended to support the validity of score reports. The author elaborates 
on sources of validity evidence and their implications for score reporting, describes strategies 
to build alignment betvveen tests and score reports, and provides guidelines on practices for 
developing score reports. 

In chapter 2, Hegarty revievvs current cognitive science research on hovr people understand 
visualizations of quantitative information. She described cognitive models of visualization 
comprehension focusing on the roles of perception, attention, vvorking memory, and prior 
knovvledge. Principles of data-visualization design that take into account the properties of the 
displays and the individuals are discussed. Finally, the author discusses research on comprehen- 
sion of test scores and the implications for the design of test score representations for different 
stakeholders. 

In chapter 3, Sinharay, Puhan, Haberman, and Hambleton focus on the quality of subscores. 
The authors discuss current methods to evaluate vvhen subscores satisfy the professional stan- 
dards on the reliability, validity, and distinctness of subscores (AERA, APA, 8: NCME, 2014). 
Finalİy, the authors discuss alternative approaches to subscores for the case in vvhich report- 
ing of subscores is not vvarranted and provide general recommendations on communicating 
subscores. 

Chapter 4 by Zenisky, Keller, and Park addresses the issue of reporting student grovvth. The 
authors describe current vvork in this area and elaborate on the role that this type of information 
is playing recently in informing high-stakes educational decisions. The authors present results 
from a smalİ-scale study to evaluate understanding of several common grovvth reporting dis- 
play strategies. Finally, the authors elaborate on implications of the result for reporting student 
grovvth and the need for additional research in this area. 

In chapter 5, Zapata-Rivera, Kannan, and Zvvick revievv vvork on measurement error and 
on communicating measurement error information to teachers and parents. They focus 
on analyzing the processes of designing and evaluating score reports taking into account to 
knovvledge, needs, and attitudes of the target audience. They elaborate on the need to consider 
teachers and parents as different audiences and examine the potential for using similar research 
methods and materials vvhen conducting research vvith these audiences. Finally, they provide 
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recommendations for this area that include the need for targeted research on (a) effectively 
communicating information in score reports, (b) investigating both user preferences and user 
comprehension, and (c) evaluating instructional tools to support understanding and appropri- 
ate use of assessment information vvith particular audiences. 


Part 2. Practical Applications 


This section covers the last five chapters. These chapters include score reporting vvork in various 
contexts: large-scale assessment programs in K-12, credentialing and admissions tests in higher 
education, using reports to support formative assessment in K-12, applying learning analytics 
to provide teachers vvith class- and individual-level performance, and evaluating students” inter- 
pretation of dashboard data. 

In chapter 6, O”Donnell and Sireci revievv current research and practices in score reporting 
in assessments for credentialing and admissions purposes. The authors revievv framevvorks for 
the development and evaluation of score reports, offer perspectives on validity issues regarding 
the appropriate communication of assessment results so the intended purposes of the tests can 
be fulfilled, and any potential negative consequences can be minimized. The authors conclude 
by discussing the importance of having the stakeholders” needs in mind vvhen designing score 
reports and interpretive materials for admissions and credentialing programs. 

Chapter 7 by Slater, Livingston, and Silver focuses on score reporting issues for large-scale 
testing programs. The authors discuss the score report design process, vvhich starts by assem- 
bling an interdisciplinary team of experts. The steps of an iterative score report design process 
are described. These steps include: gathering information about the test and the scores to be 
reported: creating a schedule, creating report design prototypes or “mockups”, getting the cli- 
ents reactions to the mockups and revising them accordingly: conducting usability testing or 
focus groups to get reactions to the mockups from potential users of the report: revising the 
mockups based on feedback from users, and making a final choice and getting approval from 
the client. The authors emphasize the need for frequent communication betvveen the client and 
the design team. Finally, lessons learned on applying this process are provided. 

In chapter 8, Brovvn, O”Leary, and Hattie describe various principles of effective report 
design derived from decades of empirical and theoretical research. These principles emphasize 
the utility of reports in terms of having a clear purpose and explicit guidance for interpreta- 
tion and subsequent action, and the clarity of design, guidance, displays, and language used in 
the reports. An on-line teaching and learning system, the Assessment Tools for Teaching and 
Learning system (asTTle), that has been deployed in Nevv Zealands schools is used to illustrate 
these design principles. The system allovvs teachers to create tests for their students and interact 
vvith a suite of reports including group and individual-level reports. This system vvas designed 
to support effective formative assessment practices by teachers. 

In chapter 9, Feng, Krumm, and Grover explore the use of learning analytics to make sense 
of learner performance data collected in various contexts and provide stakeholders vvith the 
information they need to support their instructional decisions. Four case studies are presented. 
"The case studies vary in context, subyects, focal constructs, analytical approaches, format of 
data collected, and student learning tasks. This chapter examines hovr the needs of practitioners 
shaped the vvork, the processes undertaken to develop data products, and the vvays in vvhich 
data products vvere ultimately used by stakeholders. The authors recognize the need to pro- 
vide teachers vvith training to help them understand and make a good use of the information 
provided by these systems. The authors suggest vvorking directly vvith practitioners to reduce 
the complexity of the process and better understanding their needs. The authors recognize the 
potential of learning analytics to support teachers in doing formative assessment. 
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Finally, in chapter 10, Corrin revievvs the design and evaluation of dashboards. Dashboards 
are being used to share assessment information and provide feedback to a variety of stakehold- 
ers. The author examines hov data are presented to students and the intended uses of interactive 
dashboards. The author revievvs the literature on score reporting and identiftes insights that 
can be used to inform the design and evaluation of dashboards. Tvvo case studies of dashboard 
use in higher education are presented, each profiling important design elements of educational 
dashboards and the methods of evaluation adopted. The chapter concludes vvith a discussion of 
issues for future research in this area. 


Final Remarks 


These chapters provide a good account of the issues researchers in the area are currently explor- 
ing. Framevvorks for designing and evaluating score reports, guidelines, lessons learned, and 
insights for future research are discussed. Both foundational research and practical applications 
are covered. The follovving questions vvere provided to authors and served as general guidance 
for vvriting their chapters: 


ə Hovv are target audience characteristics such as knovledge, needs, and attitudes taken 
into account vvhen designing their score reports/report systems? 

ə Hovr are their score reports/report systems intended to be used? 

ə  VVhat kinds of claims about student knovvledge, skills, and abilities are made and vhat 
data are used to support those claims? 

ə VVhat kinds of support mechanisms are implemented in order to ensure appropriate use 
of score report information by educational stakeholders? 


The chapters highlight the importance of clearly communicating assessment results to the 
intended audience to support appropriate decisions based on the original purposes ofthe assess- 
ment. Support may take the form of a clean and simple graphical design that clearly ansvvers the 
main concerns of the audience, interpretive materials, on-line help, video tutorials, interactive 
learning materials, and professional development. As more technology-rich, highly interactive 
assessment systems become available, the more important it is to keep in mind that the infor- 
mation provided by these systems should support appropriate decision making by a variety of 
stakeholders. Many opportunities for research and development involving the participation of 
interdisciplinary groups of researchers and practitioners lie ahead in this exciting field. 
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Validity Aspects of Score Reporting 


Richard ). Tannenbaum 


Educational tests are intended to capture and provide information about students” content 
knovvledge, skilİs, competencies, thought processes and response strategies, personalities, inter- 
ests, and so on. Of course, no one test captures all such valued information. In any case, some 
form of a score report is provided to the information-users (e.g., teachers, counselors, school- 
level and state-level administrators, parents, and students) so that they may act on the reported 
scores and any accompanying information. The score report is the bridge betvveen the informa- 
tion captured by the test and the decisions or actions of the information-users (Zapata-Rivera 8: 
Katz, 2014). A score report that is not veell-aligned vvith the test is of little value: similarly, a 
score report that is vvell-aligned, but not communicated to users in a vvay understandable to 
them is oflittle value. Stakeholders cannot make reasonable decisions or take reasonable actions 
from information that they do not satisfactorily understand, no matter hov accurate that infor- 
mation may be in reality. 

Proper understanding and use of information is central to the concept ofvalidity: “the degree 
to vvhich evidence and theory support the interpretations of test scores for proposed uses of 
tests” (Standards for Educational and Psychological Testing, 2014, AERA, APA, NCME, p. 11). 
Validity is about the extent to vvhich inferences from test scores are appropriate, meaningful, 
and useful (Hubley 6: Zumbo, 2011). 

This chapter reinforces the grovving recognition that score reports and the interpretation 
of their meaning are part of the overall argument supporting test validity (Maclver, Ander- 
son, Costa, 6z Evers, 2014, O”Leary, Hattie, 6: Griffin, 2017). The criticality of score reports in 
this regard vvas nicely summarized by Hambleton and Zenisky (2013): “Quite simplİy, reporting 
scores in clear and meaningful vays to users is critical, and vrhen score reporting is not handled 
vvell, all of the other extensive efforts to ensure score reliability and validity are diminished” 
(p. 479). 

Subsequent sections of the chapter vvill introduce models and approaches that support oppor- 
tunities for better alignment betvveen vvhat the test is intended to measure and the score report, 
hovr to build meaningful score reports, and validity-related questions and validation methods 
to evaluate the effectiveness of score reports. Sources of validity evidence for test scores and hovv 
these sources relate to score reports and their interpretation are highlighted first. 
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The Standards for Educational and Psychological Testing IStandards) (2014) discuss five 
sources of validity evidence. Each source reflects a different focus, and depending on the 
intended test score use, one or tvvo sources may take priority. It is not the case that each and 
every source must be applied to each and every score use. The source of validity evidence should 
be aligned vvith the purpose of the test and offer backing for the intended use of the test scores. 


Evidence Based on Test Content 


"This source addresses the extent to vvhich the content of the test reflects the content domain 
the test is intended to represent. Essentially, the notion is that a test is really a sample ofa much 
larger domain of knovtledge, skills, abilities, udgments, and so on that is of interest. A one- or 
tvvo-hour test likely cannot cover everything of interest. Even a test battery as extensive as that 
required to become a certified accountant, vvhich includes four tests each covering a specific 
domain, vvith each test being four-hours long, still only represents a sample of vhat accoun- 
tants are expected to knovv and be able to do (vvvvvv.aicpa.org/ BecomeACPA /CPAExam/Pages/ 
default.aspx). It therefore becomes important that there is evidence that a test is a reasonable 
reflection of the larger domain or domains of interest. The transfer to score reports is that there 
should be evidence that the reported information is aligned vvith the test content, and pre- 
sented in a vvay that is understandable to the stakeholders. The score report should not intro- 
duce test-irrelevant information, but should be a faithful reflection of vvhat the test measures 
and hovv the test taker(s) performed on the test, vvhich may include areas to improve upon and 
the identification of resources to assist in that regard. 


Evidence Based on Response Processes 


"This source of evidence becomes more important vrhen the test scores are intended to reflect the 
strategies, approaches, or cognitive processes test takers are using to address the test items or 
tasks. Evidence supporting intended response processes may come, for example, from students 
verbally reporting out either hovv they are solving the task or hov they solved the task (cognitive 
laboratories or think-alouds, e.g., Leighton, 2017): from maintaining keystroke logs, for exam- 
ple, to understand vvriting strategies (e.g., Leiften 6: Van VVaes, 2013): and or from eye-tracking 
data to document and evaluate vvhere on a task students are paying more and less attention (e.g., 
Keehner, Gorin, Feng, 6: Katz, 2016). VVhen applied to score reporting, evidence should support 
that the score-report users are attending to the more relevant or salient features of the report, 
and interpreting that information as intended. Hovr easily users İocate reported information 
and focus on that information is closely related to hovv vvell the score report is designed and 
organized (Hambleton $: Zenisky, 2013). 


Evidence Based on Internal Structure 


One focus here is on evidence that substantiates the intended structure of the test. Structure in 
this context means the number of dimensions or constructs the test vvas designed to address. For 
example, if a test vvas intended to measure reading comprehension and reading fluency, there 
should be evidence supporting this. Evidence of internal structure may also relate to assuring 
that individual items on a test do not function differently by subgroups of test takers (e.g., boys 
and girls, African American test takers and VVhite test takers, or that the test scores, overall, 
have the same meaning for subgroups of test takers (Sireci 6: Sukin, 2013)). VVhen applied to 
score reporting, evidence should confirm that stakeholders recognize the intended relationship 
among the information reported: for example, hovv a reported range of scores corresponds to 
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the vvidth ofa band on a figure in the report. Further, evidence should support that subgroups of 
stakeholders understand the same reported information in the vvay intended. If, for example, the 
same score report is intended to be shared vvith native and non-native speakers of English, the 
reported information should be accessible to both groups. The report should not include lan- 
guage that is unnecessarily complex and therefore less understandable by non-native speakers. 


Evidence Based on Relationships to Other Variables 


"his source of evidence is most applicable vvhen test scores are expected to be related to another 
measure or an outcome. One example of this is vvhen a test is designed to predict the likeli- 
hood that high school students may be successful in their first year of college (as measured by 
their grade-point average). In the context of score reporting, evidence could take the form of 
comparing hovv closely the level of students” competency expressed on the score report is to 
teachers” evaluations of those students” competencies. A convergence betvveen the report and 
teacher evaluation vvould be confirmatory evidence. A disparity betvveen the tvvo sources vvould 
be more difficult to interpret, as it could reflect inconsistency betvveen hovr the test has opera- 
tionalized the content and hovr the teacher defines and values that same content. 


Evidence for Validity and Consequences of Testing 


VVhen a test score is used to inform a decision or to guide an action there vvill be consequences for 
one or more stakeholders (Geisinger, 2011) Lane, Parke, S: Stone, 1998). Some ofthe consequences 
are intended and desired. A test, for example, that is intended to support student learning and 
development, and in fact, does ust that, is a positive intended consequence. Hovvever, sometimes 
tests may have unintended, negative consequences (Sfandards, 2014). One vvell-knovvn example 
is in the context of K-12 state-vvide accountability testing, vvhere teachers may teach only to the 
material covered by the test and not to the full curriculum (Koretz $: Hamilton, 2006, Lane öz 
Stone, 2002). VVhen applied to score reporting, evidence should indicate that stakeholders are act- 
ing on the reported information in vvays consistent vvith reasonable expectations, and not making 
inaccurate interpretations leading to inappropriate decisions. Such evidence could be collected 
from surveys of or intervievvs vvith stakeholders about hov they actually used the reported infor- 
mation, as vvell as vyhat aspects of the reported information vvere most and least useful. 


A Brief Recap 


It may be helpful at this point to recast the information above. Score reports are intended to 
provide stakeholders vvith the information they need, in a vvay that they understand, so that 
they may reasonabiy act on that information. There should be evidence that supports that the 
reported information is interpreted and understood, as intended, and that the decisions and 
actions based on that interpretation and understanding foster positive, intended consequences 
and very fevv, if any, negative consequences. These are important validity-related goals of score 
reporting. The next section describes general strategies for helping to assure these goals are met, 
beginning vvith assuring the alignment betvveen tests and score reports. 


A Strategy to Build Alignment Betvveen Tests and Score Reports 


Yogi Berra, hall-of-fame catcher and philosopher, once remarked: “If you dont knovv vvhere 
you are going, you might vvind up someplace else” (http://nypost.com/2015/09/23/35-of-yogi- 
berras-most-memorable-quotes)—vvords of vrisdom vrhen thinking about score reports. VVhat 
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do vve vvant to conclude about test takers from their performance on the test? VVho vvill use 
this information? Tn addition, vvhat are they expected to do vvith that information? In other 
vvords, vyhere do vve vvant to vvind up? If vve cannot ansvvers these types of questions, it is not 
likely that the test, score report, and the actions taken based upon the reported scores vill be of 
much value, There should be a direct and transparent alignment among the test purpose (vhat 
it is supposed to accomplish), the test content and format, hovv responses to the test items or 
tasks are scored, and hovv summaries of those scored responses are reported to stakeholders. 
It is through accurate alignment that there is a much greater likelihood that the stakeholders 
vvill act on the reported information in a vvay that is consistent vvith vhat the test vvas supposed 
to accomplish. This point is reinforced by the International Test Commission (2014), vvhich 
emphasizes that 


the test development, scoring, and analysis stages should all take into consideration the 
final product—the reported interpretation of the scores. In this sense, the underlying aim, 
or tacit first step of the vyhole process of test development, is ensuring that the reported 
score vvill be properly understood. 

(p. 205) 


Stated somevvhat differently, vvork backvvard from vrhat is desired to be communicated in the 
score report to assure that the test ultimately provides the desired information. This is, in fact, 
the value of developing vvhat is knovvn as a Prospective Score Report (PSR, Zapata-Rivera, Han- 
sen, Shute, Undervvood, 6: Baver, 2007, Zieky, 2014). 


Prospective Score Reports 


A PSR is a mock-up of vhat the final score report should include and look like. Tt lays out all 
the specifications that are to be communicated. “The PSR should be used at the beginning of 
assessment design, vvith the reporting measures continually being refined, as necessary, through 
the remainder of the assessment design process” (Zapata-Rivera et al., 2007, p. 275). In other 
vyords, the test and the score report should be developed more or less simultaneously. Once the 
claims of the test and needed evidence are clearly formulated, the PSR should be developed: 
and as the test development process moves ahead, implications for the score report need to be 
documented and represented in changes to the PSR. 

Hovvever, more often than may be desirable, score reports tend to be considered near the end 
of the test-development process, perhaps because they occur at the end of the process. That is 
neither effective nor efficient. There are spatial organizations to score reports as vvell as technical 
parameters and constraints. Deciding at the end of the test-development process, for example, 
to have the score report provide exemplars of students” responses at different score levels may 
not be feasible, if the related information-technology system has not made prior accommoda- 
tions for that detailed reporting. That kind of specificity should have been thought about at the 
beginning of the test-development process, or sometime during the process, so that that expec- 
tation could have been factored into the reporting system in a timely manner. 

A failure to construct PSRs—vvaiting until the end to develop score reports—also runs the 
risk of holding the score report “accountable” for reporting information that vvas not satisfac- 
torily collected by the test. This is all too common. One instance of this is vvhere a test vvas not 
designed to provide sufficient evidence at a subscore level (a score about a specific test section 
or group of test items), but an influential stakeholder later expects meaningful subscores to be 
reported. The expectations of the stakeholder should have been included at the very beginning 
of the test-design stage, vvhen claims vvere being formed about vvhat to measure and vvhat vvas to 
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be concluded about test takers. This vvay, the test could have been designed to include enough 
items or tasks to generate the evidence needed to provide meaningful subscores. It may also 
be the case that constraints on testing time and test fees vvould not permit this to occur. This 
vvould be important to knovv much sooner than later, so the expectations of stakeholders could 
be moderated. 


Guidelines and Practices for Developing Score Reports 


The prior section noted the importance of vvorking backvvard, and the value of Prospective 
Score Reports in that regard. This section discusses specific steps that one may follovr to con- 
struct score reports. First, it is important to note that there are some questions that must be 
ansvvered from the start, as they have implications for the test and score report: as such, they 
may be thought ofas design claims. 


Questions to Consider 


One question is viill the reported information be used for formative purposes or for summative 
purposes? A formative use supports learning and development opportunities (Black 8: VViliam, 
2009). VVhere are students in need of more focused instruction? A formative-focused reporting 
of scores, therefore, vvould include feedback to let the teacher and the students knovv vhat con- 
tent or skilİs are to be bolstered. Brovvn, O”Leary, and Hattie (this volume) provide further dis- 
cussion of score reports used for formative purposes. A summative use is more often associated 
vvith outcomes: VVhat have students learned, for example, at the end of the semester? VVhich 
students on the state-mandated test are classifled as Proficient? In these cases, the score report 
tends not to include the kinds and level of feedback sufficient to support the current cohort of 
students learning. 

A second question is vvill the score report be static or dynamic? A static report is more or less 
the traditional form of a report: there is one “vievv” of the information offered to stakeholders, 
and the report may be paper-based or digital. Hovvever, there is no opportunity to customize 
the report to meet varying needs. This does not mean that the information is necessary lacking 
or that the report is of lovv quality (Zenisky 8: Hambleton, 2016). If a static report vvill meet 
stakeholder needs, then that is fine. On the other hand, dynamic reports allovv stakeholders to 
interact vvith the reporting system (to varying degrees) to customize the reported information. 
Sometimes, the reporting system may include pre-determined vievvs of test-score summaries 
that can be selected (e.g., scores by different subgroups, line graphs versus pie charts), so the sys- 
tem is partially interactive (Tannenbaum, Kannan, Leibovvitz, Choi, 6: Papageorgiou, 2016). In 
other instances, the stakeholders may be able to “request” nevv or additional analyses for inclu- 
sion in the report. The most vvell-knovvn system for reporting group-level scores is the NAEP 
Data Explorer (vvvvvv.nces.ed.gov/nationsreportcard/naepdata). Highly interactive systems are 
not very common, as they are costly and somevvhat difficult to develop. For more details about 
dynamic score reports see Feng, Krumm, Grover, and D/Angelo (this volume). 

A third question is vvill the score report include subscores? As several researchers have noted 
(e.g., Haberman, 2008, Sinharay, 2010, Zenisky 8: Hambleton, 2016), increasingly stakeholders 
are expecting to receive subscore information. This is mainly due to the belief that vvith subscore 
information teachers and students vvill knovr vvhere additional attention is needed—subscores 
serve a formative purpose. This is both logical and reasonable. That said, to have confidence in 
the meaning of a subscore, there needs to be sufficient numbers of items addressing that area. 
If there are too fev, the subscore is not likely to be an accurate reflection of a students com- 
petency in that area, Sinharay (2010) suggests that 20 items may be needed. Further, he notes 
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that the information in one subscore needs to be sufficiently distinct from other subscores to be 
meaningful. If different subscores are essentially providing redundant information (are highly 
correlated vvith one another), there is little fustification for reporting them. Sinharay, Puhan, 
Hambleton, and Haberman (this volume) provide more information about reporting subscores. 

"The last question to consider is hovv to communicate to stakeholders the imprecision asso- 
ciated vvith test scores? "hose of us in the field of measurement readily accept that test scores 
are not perfectly reliable, and vve can communicate vvith each other in terms of standard errors 
Of measurement, conditional standard errors of measurement, and confidence intervals (Zeni- 
sky 6: Hambleton, 2016, Zvvick, Zapata-Rivera, 6: Hegarty, 2014). Hovvever, communicating 
uncertainty in scores to other stakeholders (e.g., teachers, students, parents, administrators, 
politicians) in a vvay that they vvill understand is not easy. Yet, our professional guidelines (Stan- 
dards, 2014, Standard 6.10) expect that score reports vvill include explicit information about 
the measurement error (imprecision) associated vvith reported test scores. Zvvick et al. (2014) 
illustrate the challenges vvith presenting this information to teachers and college students. Zapa- 
ta-Rivera, Zvvick, 6: Kannan (this volume) provide detailed discussion of communicating mea- 
surement error to parents as vvell as teachers. 


Steps to Follov, to Develop Score Reports 


Specific guidelines for developing and evaluating score reports are offered by Hambleton and 
Zenisky (2013), the Sfandards (2014: e.g., Standards 6.10, 8.7. 8.8, and 9.8), Zapata-Rivera, 
VanVVinkle, and Zvvick (2012), and Zenisky and Hambleton (2012). These sources, as may be 
expected, share much in common: the Hambleton and Zenisky (2013) model serves as the basis 
for the follovving discussion. 

"The first three steps in their seven-step model are likely best implemented concurrently, 
as they collectively relate to defining the relevant stakeholders and their needs, and gathering 
examples of existing score reports that may be useful to consider. These three “framing” steps, 
should occur early in the test conceptualization and design process to better assure that align- 
ment to the test purpose is engineered. 


Sfep 1 


This includes the explicit delineation of the purpose of the score report—vvhat needs are the 
reported information expected to meet? VVhat is the reported information intended to provide? 
A primary focus of this step is to assure that the score report content accurately reflects vvhat the 
test set out to accomplish. For example, if the test vvas intended to be formative, the score report 
needs to reflect the qualities of feedback important to fulfill that use. 


Step 2 


This step, as noted, is closely aligned vvith the first step, in that it focuses on identifying the rele- 
vant stakeholders vvho vvill rely on the score report to make decisions or to take actions. In fact, 
Hambleton and Zenisky (2013) note that key stakeholders should be consulted during Step 1 to 
flesh out the specific needs to be met by the score report. It is during these first tvvo steps that 
the differing needs of stakeholders vvill emerge, not only in terms of desired content, but also in 
terms of delivery mode or level of interactivity expected. 

Zapata-Rivera and Katz (2014) expand upon the need to consider stakeholders through their 
recommendation of conducting an audience analysis. One aspect of the analysis is clarifying 
audience needs, vvhich is closely related to vhat vvas already discussed: identifying score-user 
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goals and vhat they intended to do vvith the reported scores. The second aspect addresses the 
audiences test-relevant knovledge and general test literacy. VVhat does the audience knovv 
about the test, about the test-taking population, and about general measurement principles? 
Less test-savvy audiences, for example, may need more explanations and supporting informa- 
tion to make proper score-based interpretations. As noted earlier, vvhile measurement experts 
may have little difficulty understanding standard errors, this statistic may be confusing to less 
test-savvy audiences. "he last aspect of audience analysis focuses on audience attitudes or pre- 
conceived notions or biases about testing (consider the recent opt-out movement regarding 
K-12 accountability testing: Bennett, 2016). This includes vvhat the audience expects of test 
takers, and about hovr much emphasis they vvill likely place on the reported information or 
perceived time they have to consider and act upon the information. 


Step 3 


The focus here is on revievving existing samples of score reports (conducting a revievv of pub- 
lished or othervvise retrievable examples) to see if some ofthe needs and functionalities revealed 
during the first tvvo steps have already been addressed. The goal here is to take advantage of 
representations, or portions thereof, that may be useful and to avoid repeating others” mistakes. 


Sfep 4 


This step involves pulling together the information collected from the prior steps to develop 
one or more Prospective Score Reports (PSRs, Zieky, 2014). Tt is during this stage that care 
must be taken not only to build prototypes that appear to meet stakeholder data needs and the 
obyectives of the test, but also to consider issues of design and presentation clarity, scaffolding 
to support proper interpretation, and, accessibility, more generally. As Hambleton and Zenisky 
(2013) note, this stage may require the involvement of many different expert groups, such as 
content specialists, measurement experts, user interface experts, information technology spe- 
cialists, graphic designers, cognitive scientists, and others, depending on the intended form and 
functionality of the reports. Given the role of score reports in the overall validity argument for 
a test, the investments made here are vvorthvvhile. 


Steps 5 and 6 


Once PSRs are available, it is important to gather data about the extent to vvhich the score 
reports are communicating the information to stakeholders as intended (Step 5) and to revise 
accordingly (Step 6). These steps should be considered iterative, as more than one round of 
evaluation and revision vvill likely be needed. During the Step 5 evaluation, data should be col- 
lected about stakeholders” reactions to the hov the information is displayed—is the information 
visually accessible? This vvould include questions, for example, about readability, preferences 
for different data presentations, and aspects of the score report that may not be attended to as 
expected. Data should also be collected about stakeholders” understanding of the information. 
Do stakeholders interpret the information as intended? Are they able, for example, to describe 
accurately vvhat the scores mean from the information displayed, or do they misinterpret cer- 
tain pieces of information? 

Hambleton and Zenisky (2013) produced more than 30 questions that may be considered 
as part of the development and evaluation of prospective score reports (as such, these ques- 
tions may inform Steps 4 through 6). The questions address eight key areas: Needs Assessment 
(i.e., Does the score report reflect the expectations of stakeholders?):, Content—Report and 
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Introduction and Description (e.g., Is the purpose of the test described?), Content—Scores and 
Performance Levels (e.g., Is information about proper and improper use of score classifications 
provided?): Content—Other Performance Indicators (e.g., If subscores are reported, is infor- 
mation about imprecision also included?):, Content—Other (e.g., Does the score report provide 
contact information if questions arise?), Language (e.g., Is the report free of technical fargon 
that may be confusing?): Design (e.g., Is the report clearly and logically laid out to facilitate 
readability?), and Tnterpretive Guides and Ancillary Materials (e.g., If an interpretive guide is 
provided, vrhat evidence is there that it is understood by stakeholders)? 

The International Test Commission (2001, 2014) also provides guidance regarding the devel- 
opment and evaluation of score reports. "his includes, for example, using a reporting structure 
and format that is aligned vvith the test purposeş assuring that the technical and linguistic levels 
of the reported information are appropriate for the score-report users: and providing sufficient 
scaffolding to support proper interpretation. Hattie (2009) and Van der Kleii, Eggen, and Enge- 
len (2014) discuss other considerations vvhen designing score reports. 

Different methods may be used to collect the needed feedback (validity evidence). Fre- 
quently used approaches include intervievvs (including think-aloud approaches), eye tracking, 
focus groups, surveys, and more formal experimental designs (International Test Commission, 
2014: Hambleton 8: Zenisky, 2013, Zenisky 6: Hambleton, 2016, Standards, 2014). Tannenbaum 
et al. (2016), for example, provided different versions of mock score reports about English learn- 
ers” English proficiency to a committee of teachers of English as a Second Language. During 
the focus-group meeting the teachers vvere asked to consider vvhat aspects of the mock reports 
vvere more or less useful to inform possible instruction, and vvhat nevv or additional informa- 
tion vvould be helpful in that regard. Their recommended modifications vere used to revise the 
mock ups. 

Kannan, Zapata-Rivera, and Leibovvitz (2016) successfully utilized one-on-one think-aloud 
protocols vvith parents (of varying levels of English proficiency and educational backgrounds) 
to gather evidence of their understanding of mock student-focused score reports. This approach 
enabled Kannan et al. to better understand those aspects of the reported information (e.g., 
errors) that vvere especially problematic for parents vvho are themselves English learners or 
vrho have comparatively lovver levels of education (e.g., high school or some post-secondary 
education). 

Zvvick et al. (2014) conducted tvro experimental studies vvhereby teachers and then college 
students vvere randomly assigned to one of four conditions, vvith each condition depicting differ- 
ent vvays of reporting score imprecision. In each study, the participants completed comprehen- 
sion and display-preference questionnaires, as vvell as a background questionnaire to document, 
in part, participants” prior knovvledge of measurement and statistics. The experimental design 
structure enabled Zvick et al. to uncover misconceptions about interpreting error and confi- 
dence bands, as vvell as to understand the relationship betvveen familiarity vvyith measurement 
and statistics and the ability to comprehend more technically sophisticated score reports. 


Step 7 


This final step addresses the importance of monitoring the usefulness of the score report once 
the test becomes operational. Questions to consider here include, for example: VVhat decisions 
are stakeholders actually making from the test scores? Hov are the test scores being used in 
conğunction vvith other information? VVhat reported score information is most or least helpful? 
VVhat information might be important to add to the report in a future version? İt is also desir- 
able at this stage to try and gather evidence of any unintended, negative consequences resulting 
from the interpretation and use of the reported scores. Unintended consequences may arise 


Validity Aspects of Score Reporting ə 17 


from misunderstanding the meaning of the scores, including attributing more meaning to the 
scores than is yustifted. The results from this step may suggest the need to make revisions to the 
report structure, content, or supporting materials to foster proper interpretation and use. 


Conclusions 


No matter hovr vvell a test is conceptualized, designed, and implemented, if the scores reported 
are not readily understandable to stakeholders, all the prior hard vvork and effort may have been 
in vain. A correct understanding and interpretation of score reports is a prerequisite for stake- 
holders to make reasonable decisions (Van der Klei) et al., 2014). “The interpretability of score 
reports . . . is of the utmost significance and is fundamental in claims about validity” (O”Leary 
et al,, 2017, p. 21). 

Evidence of validity is one of the central tenets of quality testing. Hovvever, as Hattie (2014) 
noted, too little attention has traditionally been devoted to including score reports in the overall 
validity argument of testing: although that is changing, as this chapter, in part, strived to high- 
light. There are tvvo key messages from this chapter: First, score reports are, in fact, integral to 
assuring the validity and utility of test score interpretation and useş and second, score reports 
must be conceptualized and developed as early in the test-development process as possible to 
better assure alignment to the test purpose and testing obiectives, to better assure vve end up 
vyhere vve had intended, and not someplace else. 
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Advances in Cognitive Science and 
Information Visualization 


Mary Hegarty 


Information visualizations are important forms of human communication. They are used to con- 
vey many types of data, including the results of scientific experiments (Belia, Fidler, VVilliams $x 
Cumming, 2005), risks of disease (Garcia-Retamero 6: Cokely, 2013), vveather forecasts (Lovve, 
1996), and test reports (Goodman 6: Hambleton, 2004) to a variety of stakeholders includ- 
ing domain experts, policy analysts, and the general public. This chapter provides an overvievv 
of current cognitive science research on hovr people understand visualizations of quantitative 
information. In this chapter 1 first revievv theories of hovv visualizations “augment cognition” 
and describe cognitive models of comprehension of data graphs, including the contribution of 
perception, attention, vvorking memory and prior knovvledge to comprehension. Next, 1 dis- 
cuss hovv different visualizations of the same data sometimes convey different messages and 
hovr individuals vvith various levels of expertise and prior knovvledge sometimes interpret dis- 
plays differently. This research vvill be used to argue for principles for the design of effective 
visualizations that have emerged from both theory and empirical studies of comprehension of 
information visualizations, taking into account the fact that different displays may be more and 
less effective for different uses and for different individuals. Finally, 1 revievv recent research on 
comprehension of test scores in the light of cognitive science theories and empirical research, 
and derive implications for the design of test scores for different stakeholders such as parents, 
teachers, and educational administrators. 

The broad category of “information visualizations” includes diagrams, maps, and graphs, 
that is, external visuo-spatial displays that can represent obfects, events, numerical data, or 
more abstract information. Visualizations can be categorized based on the relation betvveen the 
representation and vvhat it represents, and hovv space is used to convey meaning. One category 
of visual displays consists of icozc displays such as pictures, maps, and dravrings. In iconic dis- 
plays, space on the page represents space in the vvorld and the properties displayed (shape, color, 
etc.) are also visible properties of vvhat is represented (e.g., the curve of a road on a road map, 
the color of blood in a diagram of human heart). Displays in second category, relational displays 
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are metaphorical in that they represent entities that do not have spatial extent or visible proper- 
ties (e.g., vvhen an organization chart shovvs the hierarchy of positions in a business, or a graph 
shovvs the price of stocks over time). In these displays, visual and spatial properties represent 
entities and properties that are not necessarily visible or distributed over space. Visual-spatial 
variables, such as color, shape, and İocation can be used to represent any category or quantity. 
The term information visualization is usually used to refer to this type of display (Card, Mackin- 
lay, S: Sehneiderman, 1999). Tn contrast to iconic displays, vvhich can be traced back to ancient 
cave dravvings, these types of displays are a relatively recent invention. Specifically the invention 
of the data graph is attributed to Playfair in the 18th century (VVainer, 2005). 


VVhy Visualize? The Advantages of Visualizing Data 


Information visualizations are often said to enhance or “augment” cognition (Card et al., 1999, 
Larkin 6: Simon, 1987, Scaife 8: Rogers, 1996). Cognitive scientists and other specialists have 
proposed a number of vvays in vvhich presenting data graphically can enable people to under- 
stand or reason about the data. First, information visualizations store information externally, 
freeing up vvorking memory for other aspects of thinking (Card et al., 1999, Scaife 8: Rog- 
ers, 1996). Second, they can organize information by indexing it spatially, reducing search and 
facilitating the integration of related information (Larkin 8: Simon, 1987, VVickens 8: Carsvvell, 
1995). Graphs organize entities by placing them in a space defined by the x and y axes. Asa 
result, similar entities are visualized as close together. For example, in Figure 2.1, vvhich shovvs 
a scatter plot relating a (fictional) sample of childrens test scores to their parents income, the 
dots representing children vvith similar levels of test scores and parental income are located 
close together in the display. Graphs can also allovv the offloading of cognitive processes onto 
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Figure 2.1 Scatter plot shoving the relationship of SAT Math scores to family income for a fictitious class 
of 30 students. 


Advances in Cognitive Science e 21 


automatic perceptual processes (Scaife 6: Rogers, 1996). VVhen non-visual data are mapped 
onto visual variables, patterns often emerge that vvere not explicitly built in, but vvhich are easily 
picked up by the visual system, for example, vvhen a line in a graph reveals a linear relationship 
betvveen variables (Shah, Freedman, 8: Vekiri, 2005). They can enable complex computations to 
be replaced by simple pattern recognition processes. 


Cognitive Processes in Using Visual-Spatial Displays 


Although visual-spatial displays can enhance thinking in many vvays, this does not mean that 
their use is necessarily easy or transparent. Figure 2.2 presents a model of comprehension of 
information visualizations, vvhich is adapted from other cognitive models of hovr people under- 
stand visuo-spatial displays such as graphs (Carpenter 8: Shah, 1998, Hegarty, 2011, Kriz 8z 
Hegarty, 2007, Pinker, 1990). According to these models, graph comprehension involves a 
complex interplay betvveen display driven (bottom-up) and knovrledge driven (top-dovvn) pro- 
cesses. First, the visual system senses the basic visual features of the display, such as color and 
shape and encodes these features to construct an internal representation of the visualization 
itself. In complex visualizations, not all features are necessarily encoded, and vvhich features are 
encoded depends on attention, vvhich might be directed by the vievver5 goals and expectations 
(top-dovvn processing) or vvhat is salient in the display (bottom-up processing). For example, 
one difficulty in display comprehension might arise if the vievver is distracted by highly salient 
but task-irrelevant information such as a picture in the background of a graph, so that the 
vievver fails to encode the critical information, although it is presented. 

In addition to basic perceptual, attentional, and encoding processes, vvhich construct a visual 
representation of the display, the user of an information visualization typically has to apply 
knovvledge to construct a meaningful representation of the information presented in the display. 
This can include knovvledge of the conventions of the display, for example, that the independent 
variable in an experiment is typically represented on the x axes and the dependent variable is 
represented on the y axis of a data graph (Gattis 8: Holyoak, 1996), or the meaning of error bars 
in a graph (Cumming $: Finch, 2005). This type of knovvledge is often referred to as a graph 
schema (Pinker, 1990, Ratvvani 6: Trafton, 2008). Understanding a graphic can also include 
making further inferences based on domain knovvledge (for example understanding vvhether a 
students test score is in the normal range for his or her grade level) or visual-spatial processes 
(e.g., detection of a linear increase in test scores over time) so that the resulting internal repre- 
sentation comes to contain information that is not presented explicitly in the external display. 
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Figure 2.2 Schematic overviev/ of the different representations (indicated by boxes) and processes (indi- 
cated by arrov/s) involved in understanding an information visualization. 
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AII Visualizations Are Not Equal 


A prominent conclusion from research on comprehension of graphs and other visual displays is 
that there is no such thing as a “best” visualization of a given data set, independent of the task 
to be carried out vvith this display or independent of the user of the display. First, visual dis- 
plays are used for many different purposes such as recording and storing information, serving 
as computational aids, data exploration, and conveying information to various stakeholders. 
Visualizations that are effective for one purpose (e.g., data exploration) might not be effective 
for another (e.g., communication to the general public). Second, the vvay in vrhich informa- 
tion is displayed graphically can have povverful effects on hovr it is interpreted and processed, 
providing evidence for bottom-up influences of display design. People interpret the same data 
differently, depending on vrhether they are presented in pie charts or bar graphs (Cleveland 8: 
McGill, 1984: Simkin 8: Hastie, 1986), bar graphs or line graphs (Shah, Mayer, 6: Hegarty, 1999), 
and vvhich variables are assigned to the x and y axes (Gattis 6: Holyoak, 1996, Peebles 8: Cheng, 
2003, Shah 8: Carpenter, 1995). Some of these effects can be traced to the Gestalt principles of 
perceptual organization, vvhich determine vvhich elements of displays are grouped, and can be 
compatible or incompatible vvith the tasks to be carried out vvith a display. For example, line 
graphs facilitate comparisons for the variable plotted on the x axis (time in Figure 2.3a) because 
the lines group data points as a function of this variable, reflecting the Gestalt principle of good 
continuation (Shah 6: Freedman, 2011). In contrast, bar graphs facilitate comparisons betvveen 
the variables shovvn in the legend (GRE sub-score in Figure 2.3b), because the bars comparing 
data points vvith respect to this variable are closer, reflecting the Gestalt principle of proximity. 
Thus, in Figure 2.3a it is relatively easy to see that the verbal sub-score increased over the period 
shovvn or that the vvriting sub-score vvas more variable over time. In contrast Figure 2.3b seems 
to emphasize the differences betvveen the three GRE scores, for example that for this depart- 
ment, the quantitative scores are highest, and the vvriting scores are İovvest. 

Displays that are effective for one task may be ineffective for another. For example, tables 
are better than graphs for communicating specific values vvhereas graphs are better than tables 
for conveying trends in data (Gillan, VVickens, Hollands, 8 Carsvvell, 1998). Pie charts are an 
interesting case in point. During the 20th century, statisticians developed a strong bias against 
the pie chart preferring divided bars to display proportions. Hovvever, careful experiments indi- 
cated that for some tasks, pie charts are as effective as divided bar charts, and for other tasks 
they are actually more effective (Simkin 8: Hastie, 1987, Spence 8: Levvandovvsky, 1991). Simple 
iudgments (e.g., comparing the proportions of tvvo entities) vvere slightly more effective vvith bar 
graphs, but complex comparisons (e.g., comparing combinations of entities) vvere more efficient 
vvith pie charts (Spence 8: Levvandovvsky, 1991). For example, in Figure 2.4a it is easier to see 
that about half of the class got either A or B grades vvhereas in Figure 2.Ab it is easier to see that 
the proportion of B and C grades vvas approximately equal. 

An important current issue in information visualization is hovv to best represent data uncer- 
tainty in graphical displays (Kinkeldey, MacEachren, 8: Schievve, 2017, Spiegelhalter, Pearson 8z 
Short, 2011). A common method of presenting information about uncertainty is to use error bars 
to shovv confidence intervals, but error bars are often misunderstood, even by researchers vvho 
use them in interpreting their ovrn data (Belia, Fidler, VVilliams, 8: Cumming, 2005). More novice 
participants also have misconceptions about error bars, such as the assumption that the estimate 
is equally likely to be anyvvrhere vvithin the error bars and not at all likely to be outside the error 
bars (e.g., Correll 6: Gleicher, 2014) Zvvick, Zapata-Rivera, 6: Hegarty, 2014). Because of miscon- 
ceptions in interpreting error bars, researchers have advocated alternative forms of uncertainty 
visualizations, including violin plots and faded representations that shovr graded probability of 
estimates vvith more distance from the mean (Correll $: Gleicher, 2014, Cumming, 2007). 
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Figure 2.3 Line and bar graphs shoving mean percentile scores of students admitted to a fictitious science 
department over an eight-year period. The data displayed in the tuvo graphs are identical and shov mean 
percentile scores over time for the verbal, mathematics, and vriting subtests of the GRE. 
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Figure 2.4 Pie and bar charts shoving distribution of grades in a fictitious college course. 


AII Consumers of Visualizations Are Not Equal 


Comprehension of graphical displays is also influenced by knovvledge. Experts and novices 
attend to different aspects of visual displays and extract different information from these dis- 
plays. These top-dovn effects of knovrledge on graphics comprehension can be separated into 
eflects of knovvledge or facility vyith graphic conventions, knovvledge of mathematics and statis- 
tics (numeracy) and knovvledge of the domain or topic of the graph data 
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Knovvledge of Specific Graphic Conventions 


First, understanding the graphic conventions of a display involves understanding hovr the 
display conveys information, for example the meaning of the axes and vvhich visual variables 
(color, shading fill patterns etc.) represent each aspect of the data. VVhile some of this infor- 
mation is often included in a figure caption or legend accompanying a graph, more basic 
information is often assumed. For example, Kozhevnikov and colleagues (Kozhevnikov, 
Hegarty, 6z Mayer, 2002) shovved simple graphs of motion to undergraduate students vvho 
had not taken any physics classes and had been classified as “high-spatial visualizers” and 
“lovv-spatial visualizers” The high-spatial visualizers correctly interpreted the graphs. For 
example, vvhen shovn the graph in Figure 2.5, one student described the display as follovvs: 
“At the first interval of time the position is the same: it cannot move. İt has a constant velocity 
at the second interval. It is moving constantly at a constant speed? Tn contrast some lov-spa- 
tial visualizers erroneously interpreted the displays, for example one student interpreted 
the same graph as follovs: “The car goes constantly and then goes dovvnhill. ... . It does not 
change its direction. İt goes dovvnhill. This is a hill” In this example, lov-spatial visualizers, 
vvere subiect to the “graph-as-picture” misconception (McDermott, Rosenquist, 6: van Zee, 
1987). Because they did not have a schema for this type of graph, they erroneously inter- 
preted it as a picture. 

Another case in point is the cone of uncertainty used to display hurricane forecasts (see 
example in Figure 2.6). The cone of uncertainty is a forecast display produced by the National 
Hurricane Center that indicates the current location of a hurricane storm, the storm3S proyected 
path (track) over the next three days, and a cone shape surrounding the track line. A basic con- 
vention of this graphic is that the vvidth of the cone at any point in time represents the amount 
of uncertainty in the forecasted location of the storm at that time point (specifically the 6796 
confidence interval). Expert meteorologists and emergency management personnel knovv these 
conventions. Hovvever, the uncertainty cone is often presented in nevvs media vvithout an expla- 
nation of hovr it is created, or vyhat is represented by the cone of uncertainty. Broad, Leiserovvtiz, 
VVeinkle, and Steketee (2007) reported that people in hurricane-affected regions of the US hold 
misconceptions about vvhat is represented by this visualization. One misconception vvas that the 
cone shovvs the hurricane getting larger over time. Another vvas that the hurricane vvas unlikely 
to travel outside the region depicted by the cone. Evidence of these misconceptions vvas also 
found in a recent laboratory study in vvhich students had to respond to true-false statements 
about the meaning of the display (Ruginski et al., 2016). For example, 6996 vvho vievved the 
visualization in Figure 2.6 endorsed a statement indicating that the display shovs the hurricane 
getting larger over time, and 4996 endorsed a statement that the damage vvas not likely to extend 
beyond the cone. A more recent study indicated that people vvere less likely to endorse these 
misconceptions if they first read a description of the display conventions (Boone, Gunalp, $x 
Hegarfty, in press). 
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Figure 2.5 Graph of position as a function of time, used by Kozhevnikov et al., 2002. 
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Figure 2.6 Example of a hurricane forecast shovving the cone of uncertainty. 


Mathematical Knovvledge 


Comprehension of information visualizations can also depend on numeracy, that is, quantita- 
tive or mathematical literacy. An important recent application of information visualization is in 
communicating medical risks to the general public (Ancker, Senathirayah, Kukafka, 8x Starren, 
2006, Garcia-Retamero 6: Cokely, 2013). To make medical decisions, patients and their doctors 
often have to understand the risks of disease, unhealthy behaviors (e.g., smoking), and both 
the benefits and possible side-effect risks of various medical treatments. Hovvever members 
of the general public have poor understanding of basic numerical and probabilistic concepts 
necessary to understand risks (Peters, 2012). Researchers in medical decision making have had 
good success in designing visual aids for communicating medical risks to the general public. 
For example, they have found that bar charts are effective for comparing magnitude of risk 
for different groups (e.g., nationalities), line graphs are effective for shovving trends over time 
such as survival curves, and icon displays are good for reducing denominator neglect, that is, 
the tendency to focus only on the number of people affected by a disease (numerator) and 
ignore the number of people vrho could potentially have been affected (denominator) (Lipkus, 
2007). Hoveever, these visual aids are not equally effective for all individuals. Researchers in 
medical decision making have developed measures of basic yumeracy related to medical risks 
(Cokely, Galesic, Schulz, Ghazal, 6: Garcia-Retamero, 2012) and graphicacy or the ability to 
interpret common graph formats such as bar graphs and icon displays (Galesic 8ç Garcia-Re- 
tamero, 2011). They found that people vvith high numeracy often have good understanding of 
medical risks, regardless of vvhether the data are presented numerically or graphically vvhile 
graphic displays are more effective than numerical descriptions for individuals vvith lovv numer- 
acy. Hovvever graphic displays vvere not effective for all lovv-numerate individuals. Specifically, 
individuals vvith poor numeracy but relatively good graphicacy had good comprehension of 
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visual aids for medical decision making vhereas the visual aids vvere ineffective for individuals 
vvith poor graphicacy and numeracy (Galesic 6: Garcia-Retamero, 2011). 


Domain Knovledge 


Content or domain knovvledge about the topic of a graphic can also affect its interpretation. 
Lovve (1996) conducted a series of studies in vvhich expert and novice meteorologists had to 
interpret vveather maps. Although both groups vvere familiar vrith the graphical conventions, 
they differed in their interpretations. For example, the experts related features of the maps that 
vvere causally associated vrhereas the novice related features that vvere visually similar or close 
together. Moreover, the experts often made inferences including predictions of hovv the vveather 
vrould change in the future, vrhereas the novices focused on describing the current vveather 
represented by the map. 

Finally, top-dovrmn effects of knovvledge can interact vvith bottom-up effects of display design 
and these interactions can affect both vvhere people look on the maps and hov they interpret the 
maps. In experimental studies, college students vvere given the task of interpreting vveather maps 
to predict vvind direction in a region of the map, given information about pressure (Canham 6 
Hegarty, 2010, Fabrikant, Rebich-Hespanha, 8: Hegarty, 2010, Hegarty, Canham, 8: Fabrikant, 
2010). Bottom-up effects of display design vvere investigated by manipulating the number of 
displayed variables on the maps, or the visual salience of task-relevant vs. task-irrelevant infor- 
mation. Top-dovvn effects of domain knovledge vvere investigated by examining performance 
and eye fixations before and after participants learned relevant meteorological principles. Map 
design and knovvledge interacted such that salience had no effect on performance before partic- 
ipants learned the meteorological principles, but after learning, participants vvere more accurate 
if they vievved maps that made task-relevant information more visually salient. 


Principles of Effective Visualization 


Based on both information processing theories and empirical research on comprehension of 
visualizations, Hegarty (2011) summarized a set of principles for the design of effective visu- 
alizations. These are best considered as heuristics for the design of displays. Many of them 
have been documented by Kosslyn (1989, 1994, 2006) in a series of articles and books about 
graph design and by Gillan, VVickens, Hollands, and Carsvvell (1998). As noted earlier, a general 
meta-principle that there is no such thing as a “best” visualization, independent of the task to be 
carried out vvith this display and the user of the display. In addition, Hegarty summarized the 
follovving sets of principles: 


Principles Related to the Expressiveness of Displays 


One set of principles refers to hovv much information should be included in a visualization. 
A general principle, referred to as the relevanıce principle by Kosslyn (2006) states that visualiza- 
tions should present no more or no less information than is needed by the user. Presenting all of 
the relevant information in the display relieves the user of the need to maintain a detailed repre- 
sentation of this information in vvorking memory, vvhereas presenting too much information in 
the display leads to visual clutter or distraction by irrelevant information (VVickens $: Carsvvell, 
1995). A related principle, the principle of capacity limitations points out that graphics should be 
designed to take account of limitations in vvorking memory and attention. The relevance princi- 
ple is related to the idea of data-ink ratio, proposed by statistician and information design pio- 
neer Edvard Tufte (2001). Tufte advocated deleting all non-data ink and all redundant data ink 
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“vvithin reason” (p. 96), including deleting background pictures that are often included in nevs- 
paper graphics, deleting the lines and tick marks on the axes of graphs and deleting the bars and 
filler patterns in bar graphs (vvhich he referred to as redundant coding). Gillan and Richman 
(1994) provided empirical evidence that increasing the data ink ratio in graphs improved accu- 
racy and decreased response time, but their research also indicated that Tuftes principle vvas too 
simplistic. For example, background pictures vvere generally disruptive, including the x and y 
axes vvas generally beneficial, and the effects of redundant coding in displaying the data (e.g. fill 
patterns for bars in bar graphs) vvere inconsistent and depended on the task and type of graph. 


Principles Related to the Perception of Displays 


To be effective, a visual display has to be accurately perceived. Tversky, Morrison, and Betran- 
court (2002) refer to this as the apprehension principle of visual displays. For example, Kosslyn 
(2006) dravvs on basic research in psychophysics to make the point that visual forms indicating 
a difference betvveen tvvo variables need to differ by a large enough amount to be perceived as 
different. This is related to the general principle that one should use visual dimensions that are 
accurately perceived (Cleveland 6: McGill, 1984) and avoid using visual variables that lead to 
biased Pudgments (VVickens 6: Hollands, 2000). For example, comparing the position of tvvo 
entities along a common scale is easier than comparing the position of entities on identical but 
non-aligned scales, and perception of length is more accurate than perception of area (Cleve- 
land 6: McGill, 1984). 

Finalİy another important principle is vvhat Kosslyn (2006) refers to as the principle of percep- 
tual organization, that people automatically group elements of displays into units. This principle 
is based on the Gestalt principles of perceptual organization, vvhich determine vvhich elements 
of displays are grouped. These groupings can be compatible or incompatible vvith the tasks to be 
carried out vvith a display. For example as discussed earlier, line graphs facilitate comparisons 
betvveen the units plotted on the x axis (see Figure 2.3a) because the lines group data points as 
a function of this variable, reflecting the Gestalt principle of good continuation (Shah $: Freed- 
man, 2011). 


Principles Related to the Semantics of Displays 


A third set of principles refers to the semantics of visual displays. A visualization is easier to 
understand if its form is consistent vvith its meaning (the compatibility principle, Kosslyn, 
2006). For example, the use of visual variables to convey meaning needs to be consistent vvith 
common spatial metaphors in our culture, such as up is good, dovvn is bad and larger graphical 
elements represent more of something. Other common assumptions include that lines indicate 
connections, circles indicate eyclic processes, and the horizontal dimension is naturally mapped 
to time (Tversky, 2011). 

An important principle emphasized by several theorists (Bertin, 1983, Zhang, 1996, Mackin- 
lay, 1986) is matching the dimensions of the visual variables vvith the underlying variables that 
they represent in terms of scales of measurement. Both representing and represented dimen- 
sions can vary in scale from categorical to interval, ordinal, or ratio (Stevens, 1946). For exam- 
ple, shape is a categorical variable, shading is an ordinal dimension, orientation is an interval 
dimension and length is a ratio dimension. Zhang (1996) proposed that representations are 
most accurate and efficient vrhen the scale of the representing variable corresponds to the scale 
of the represented variable. Efficiency refers to the fact that the relevant information can be per- 
ceived because it is represented in the external representation. For example, the use of error bars 
to display measurement error violates this principle because error bars are discrete (categorical) 
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representations of a continuous function (Cumming $: Finch, 2005). The use oferror bars there- 
fore can drive the interpretation that values of a variable can only fall vyithin the error bars and 
are equally likely anyvvhere vvithin the error bars. The match, in terms of scales of measurement, 
betvveen the representing and represented variables is a central principle coded by Mackinlay 
(1986) in a system that automated the design of relational graphics. 


Principles Related to Pragmatics and Usability 


A final set of principles relates to the pragmatics and usability of visualizations. Pragmatics 
refers to the broader context in vvhich visual displays communicate and their rhetorical func- 
tion. One general pragmatic principle, the principle of salience, states that displays should be 
designed to make the most important thematic information salient (Bertin, 1983, Dent, 1999, 
Kossİyn, 2006). A related principle, the principle of informative changes (Kosslyn, 2006) is 
that people expect changes across properties of a display to carry information. More broadly, 
the vvays in vvhich information is visually displayed can subtly communicate information. For 
example, people have more confidence in data that are presented in realistic displays, although 
this in not necessarily vvarranted (see Smallman 8: St. fohn, 2005, VVainer, Hambleton 6: Meara, 
1999). Moreover, the proportion of space taken up by graphs in /ournal articles varies across the 
sciences, vvith more space devoted to graphs in disciplines rated as “hard sciences” (Cleveland, 
1984, Smith, Best, Stubbs, Archibald, 6: Roberson-Nay, 2002). Hovv data are displayed might 
therefore give an impression of their reliability or scientific nature. 

Usability of a visualization refers to ensuring that the vievver has the necessary knovvledge 
to extract and interpret the information in the display. Visual displays are based on graphic 
conventions and users need to knovr the conventions of a particular graphic form in order 
to comprehend it. Kosslyn (2006) refers to this as the principle of appropriate knovvledge. "This 
knovvledge is often thought of as being part of the graph schema (Pinker, 1990, Ratvvani 6: Traf- 
ton, 2008). The conventions of a visual display are often provided in a legend and in cartography 
a legend is considered to be an obligatory component of every map (Dent, 1999). VVhile data 
graphs often include legends, for example stating vvhich colors, shading etc. refer to vvhich vari- 
ables in a bar or pie chart, their comprehension often depends on more basic assumptions that 
the user is expected to have. For example, an understanding of measurement error is necessary 
to interpret error bars (Zvvick, Zapata-Rivera, 6: Hegarty, 2014) and knovvledge of meteorology 
is necessary to make predictions from vveather maps (Lovve, 1996). Thus, providing a legend or 
graph schema is not sufficient for understanding a visual display. 


Applications to the Design of Score Reports 


Educational testing is becoming increasingly important, at least in the United States. Conse- 
quently, there is increasing need for the production and comprehension of score reports. Score 
reports come in different forms and are designed for different purposes (see also Tannenbaum, 
this volume). One important distinction is betvveen reports of student performance at the aggre- 
gate level (e.g., comparing nations on international assessments such as PISA and TIMMS) 
and test reports for individual students, another important distinction is betvveen reports of 
formative and summative assessments. VVith the advent of Cognitive Diagnostic Assessments, 
score reports might include information on specific skills that relate to a model of achieve- 
ment (Roberts 6: Gierl, 2010). Test reports are also used by a variety of stakeholders including 
students, parents, teachers, administrators, policy makers, and researchers. Not surprisingiy 
there are a number of existing papers that examine test reports vvith respect to principles of 
graph design (e.g., Allalouf, 2007, Goodman 8: Hambleton, 2004) Zenisky 8: Hambleton, 2012, 
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VVainer, Hambelton, 8: Meara, 1999). One general conclusion from this research is that there 
is not much standardization in test reports vvhile another, echoing the theme of this revievy, is 
that vhat makes an effective score report depends on the type of data and the consumer of the 
report. In this final section of this paper, 1 raise issues regarding the design of score reports vvith 
respect to the types of graphical principles discussed earlier. 

First, vve can revievv score reports vvith respect to principles related to the expressiveness of dis- 
plays, and specifically, the principle of appropriate knovledge. The design of score reports needs to 
be responsive to the needs of the user. This means that one might not necessarily design the same 
score report for teachers and parents, depending on vvhat teachers and parents need to knovv. İt is 
important to give various stakeholders the right amount of information. Giving them too much 
information, or extraneous information (such as superimposing a graph on a picture) might fust 
create a cluttered display in vvhich the most relevant information to their needs is not salient. 

Turning to principles related to the percepfion of displays, it is important to ensure that text 
and graphic elements are large enough to be accurately perceived and that the graph uses visual 
elements that are accurately perceived. For example, scales shovving different test scores should 
be aligned if it is important to compare a students relative performance on different subtests 
(Cleveland 8: McGill, 1984). Tt is also important to consider the perceptual organization of the 
display and hovv different types of graphs more readily communicate different aspects of the 
data. For example, as noted by Figure 2.3, it is easier to see trends over time in line graphs, 
vyhereas bar graphs might be more appropriate if the most important information is about the 
comparison of different groups. 

In terms of semantic principles, it is important that the visual variables used to display var- 
ious quantities are consistent vvith natural semantic mappings, such as using higher values on 
a graph to display larger scores and using the horizontal dimension (x axis) to shovv trends 
over time. Finally, it is important to match the scale of measurement of a numerical variable to 
the visual variable used to depict that variable. For example, color might be a good choice of 
visual variable to depict average percent correct on a test for different schools (because school 
is. a categorical variable in this instance), vrhereas height of a bar on a bar graph might be more 
appropriate for visualizing the average percent scores themselves, because both the scores and 
length of a bar are ratio variables. 

Finalİy, score reports need to be designed to follovv principles of pragratics and usability of 
information displays. The principle of appropriate knovvledge is critical here. People need to 
have the necessary knovvledge to understand a graph. This can often be provided in a legend, 
or interpretive guide. Hovvever a legend or guide explaining graphic conventions might not be 
sufficient if the person interpreting the score report does not have basic mathematical or statis- 
tical knovrledge necessary to understand vvhat is depicted, or if the guide uses statistical fargon 
that the reader is not able to understand (Goodman 8: Hambleton, 2004). For example, a graph 
of scaled scores vvill not be meaningful to a parent vvho is not familiar vvith the scale used. The 
same point can be made about error bars shovving the confidence interval for an observed score. 
For example Zvvick et al. (2014) examined teachers” comprehension of different score reports 
that shovved confidence intervals either as error bars or as more graded violin plots that shovved 
variable vvidth confidence bands (see Figure 2.7). VVhen asked vvhich displays they preferred 
and vvhy, some teachers expressed misconceptions about the nature of confidence intervals. 
For example, one teacher expressed a preference for the fixed vvidth confidence bands stating, 
“the bar of equal vvidth gives you a better picture of another score being equally likely to occur 
in that range” (p. 135). In sum, a legend or guide to a score report can only do so much. VVhile 
researchers have made some inroads to designing displays taking account of the knovledge of 
consumers (Zapata-Rivera $: Katz, 2014), vve need more basic research on vhat different stake- 
holders knovv about both measurement concepts and graphic conventions. 
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Figure 2.7 Examples ef score reports compared by Zvick et al. (2014). 


More generally, the development of effective score reports has to be a process of iterative 
design and evaluation and any score reports that are designed need to be evaluated vvith the 
actual stakeholders (teachers, parents, policy makers etc.) for vvhich they are intended. A num- 
ber of researchers (Zapata-Rivera, Vanvvinkle, 6z Zvvick, 2012, Zenisky 6: Hambleton, 2012) 
have proposed framevvorks for designing and evaluating score reports that follovv this general 
principle). This evaluation needs to go beyond considering vvhat types of score reports peo- 
ple prefer, because preference for a display is often dissociated from ability to understand it 
(Smallman 8: St. Vohn, 2005, VVainer et al,, 1999). Another issue is that there seems to be very 
little standardization of similar types of test reports across different contexts, for example test 
reports in different states (Goodman 86: Hambleton, 2004). Development of standard displays 
and graphic conventions for similar types of score reports vvould ensure that consumers have to 
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master fevver basic formats for these reports and are able to transfer their understanding of these 
reports more easily across contexts. 

Finally, given the increased prevalence of test scores and other information visualizations in 
our İives, it is vvorth considering vvhether vve need better means of educating people about hovv 
interpret graphic displays and about the measurement concepts that are necessary to interpret 
these displays. Recent studies have has some success in educating both student teachers and vvork- 
ing teachers about measurement concepts and score reports (Zapata, Zvvick, 6: Vezzu, 2016, Zvvick 
et al,, 2008). Having students interpret their ovvn test score reports might be a “teaching moment” 
for educating them about both graphical conventions and measurement concepts. 

In conclusion, cognitive scientists have made significant progress in understanding hovv 
people understand information visualizations. Insights from cognitive science have suggested 
general principles that can be used to design more effective visualizations of score reports. At 
the same time, cognitive science studies suggest that not all current problems in interpretation 
Of score reports can be solved by display design alone. These studies also suggest that vve also 
need to educate stakeholders to be more knovvledgeable about the nature of educational mea- 
surement and graphical conventions, so that they can be better consumers of score reports. 
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Subscores 


VVhen to Communicate Them, VVhat Are Their 
Alternatives, and Some Recommendations 


Sandip Sinharay, Gautam Puhan, Shelby /. Haberman, 
and Ronald K. Hambleton 


Subscores are scores based on any meaningful cluster of items on a test. For example, scores on the 
algebra and geometry sections on a mathematics test or programming applications and program 
design and development sections on a computer science test are often referred to as subscores. 
Brennan (2012) stated that users of test scores often vvant (indeed demand) that subscores be 
reported, along vvith total test scores due to their potential diagnostic, remedial, and instructional 
benefits. According to the National Research Council report “Knovving VVhat Students Knovv” 
(2001), the purpose ofassessment is to provide particular information aboutan examinees knovvl- 
edge, skill, and abilities and subscores may have the potential to provide such information. Fur- 
thermore, the US Governments No Child Left Behind (NCLB) Act of 2001 and the Every Students 
Succeeds Act (ESSA, Every Student Succeeds Act, 2015-2016) requires that state assessments 
provide more detailed and formative information. In particular, it requires that state assessments 
“produce individual student interpretive, descriptive, and diagnostic reports” (p. 26), subscores 
might be used in such a diagnostic report. As is evident, there is substantial pressure on testing 
programs to report subscores, both at the individual examinee level and at aggregate levels such 
as at the level of institutions or states. Tt is therefore not surprising that subscores are reported by 
several large-scale testing programs, such as SAT", ACT”, Praxis, and LSAT. 

"The next section provides a revievv of the literature on hovv subscores are reported and on 
recommendations regarding hovv they should be reported. The Quality of Subscores section 
includes a discussion of hovv existing subscores often do not have satisfactory psychometric 
properties. The Techniques section includes a discussion of several methods for evaluating the 
quality of subscores and a reviev of the existing analyses of subscores vvith respect to their qual- 
ity. Several alternatives to subscores are discussed in the penultimate section. The final section 
includes several conclusions and recommendations. 


Existing Findings and Recommendations on Communicating Subscores 


Goodman and Hambleton (2004) provided a comprehensive revievv and critique of score report- 
ing practices from large-scale assessments. Figures 17-21 of Goodman and Hambleton (2004) 
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include examples of several operational score reports that include subscores. "The subscores in 
the score reports examined by them veere reported in one of three forms: as number-correct or 
Tavy scores, as percent-correct scores, or percentile ravv scores. 

Figure 3.1 of this chapter shovvs an example of one section of a sample Praxis” score report 
(shovvn in vvvvvv.ets.org/s/praxis/pdf/sample score, report.pdf). A total of five subscores vvith 
maximum possible score ranging from 15 to 37 are reported for this test vvith title Praxis Ele- 
mentary Education: Curriculum, Instruction, and Assessment. "he average range of subscores 
earned by the middle 5006of test takers is reported and can be used to compare hovv vvell an 
examinee did versus the other test takers vvho took this test. Although such a report might be 
useful to assess examinees” specific strengths and vveaknesses in these five sub-content areas, 
there is a vvord of caution that the sample score report provides, vvhich is that these subscores 
are based on small numbers of questions and are less reliable than the official scaled scores. 
Therefore, they may not be used to inform any decisions affecting examinees vvithout careful 
consideration of such inherent limited precision. 

The problems that Goodman and Hambleton (2004) noticed in the score reports that they 
examined include (a) reports that assume a high level of statistical knovvledge on the part of the 
users, (b) use of statistical fargon such as “statistical significance” or “standard error” that con- 
fused and often intimidated the users, (c) misunderstanding or ignoring of technical symbols, 
concepts, and footnotes by the users of score reports, (d) reports that provided too much infor- 
mation that made it difficult for users to extract vhat they needed most, (e) including excessively 
dense graphics and displays that vvas daunting for readers, and (f) lack of descriptive informa- 
tion such as definitions and examples to aid interpretation of the results. Rick and Park (2017) 


Test / Test Category ” Your Rav Points Earned Average Performance 
Range”” 

ELEMENTARY EDUCATION: CURRICULUM, INSTRUCTİON, AND ASSESSMENT (5017) 

L READING AND LANGUAGE ARTS 33 out of 37 23-29 

II. MATHEMATICS 26 out of 31 19-25 

HI SCİENCE 15 out of 20 11-15 

TV. SOCLAL STUDIES 14 out of 17 9-13 

V. ART, MUSIC, AND PHYSICAL EDUCATION 13 out of 15 6-12 


” Category-level information indicates the number of test questions ansvvered correctly for relatively small subsets of the 
questions. Because they are based on small numbers of questions, category scores are less reliable than the official scaled 
scores, vvhich are based on the full sets of questions. Furthermore, the questions in a category may vary in difficulty 
from one test to another. Therefore, the category scores of individuals vvho have taken different forms of the test are not 
necessarily comparable. For these reasons, category scores should not be considered a precise reflection of a candidates 
level of knovvledge in that category, and ETS recommends that category information not be used to inform any deci- 
sions affecting candidates vvithout careful consideration of such inherent lack of precision. 


”“ The range of scores earned by the middle 5096 ofa group of test takers vvho took this form of the test at the most recent 
national administration or other comparable time period. N/C means that this range vvas not computed because fevver 
than 30 test takers took this form of the test or because there vvere fevver than eight questions in the category or, for a 
constructed-response module, fevver than eight points to be avvarded by the raters. N/A indicates that this test section 
vvas not taken and, therefore, the information is not applicable. 


Figure 3.1 Section of a Praxis($ test score report that includes subscores. 
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sought to extend the findings of Goodman and Hambleton (2004) by conducting a similar inves- 
tigation using score reports from 23 US states and one US territory. They used the findings of 
Goodman and Hambleton as a basis for comparison and identifled score reporting practices 
that have remained mostly unchanged over the last decade or so. Examples of such practices are 
presenting overall results both graphically and numerically and having score reports that are still 
about tvvo pages in length. Rick and Park also identifted practices that changed more visibly such 
as providing non-numerical or descriptive performance feedback, use of more color in score 
reports and more details about the precision of overall scores. Finally, they also identified nevv 
practices that vvere not commonly observed before such as contextualizing results in terms of 
“college and career readiness” For more information on these and other related score reporting 
practices, see Hambleton and Zenisky (2013) and Zenisky and Hambleton (2012, 2016). 

As interest in score reporting continues to increase, researchers have been trying to find better 
vyvays to communicate test results via good score reports. Zapata-Rivera, van VVinkle, and Zvvick 
(2012) suggested a framevvork for designing and evaluating score reports—the framevvork vvas 
based on methodologies used in the follovving areas: assessment design (e.g., Mislevy, Steinberg, 6 
Almond, 2003), softvyare engineering (e.g., Pressman, 2005), and human-computer interaction 
(e.g., Nielsen, 1994) and included the follovving steps: (a) gathering assessment information needs, 
(b) reconciling these needs vvith the available assessment information, (c) designing score report 
prototypes, and (d) evaluating these score report prototypes vvith internal and external experts. 
Tannenbaum (this volume) provided a seven-step model to develop score reports, vvhich he claims 
are integral to assuring the validity and utility of test score interpretation. Similarly, Slater, Living- 
ston, and Silver (this volume) described a step-by-step process to designing good score reporfts. 
This includes (a) gathering information about the score report needed, (b) creating a schedule 
for the score report design, (c) beginning graphic designs, (d) getting clients reactions to initial 
designs, (e) gathering feedback from intended users of the score report, (f) revising the design 
based on information from end-users, and (g) finalizing the design. These processes can not only 
be used to design score reports for the overall test scores but for subscores as vvell, 

Roberts and Gierl (2010) noted that integration and application of interdisciplinary tech- 
niques from education, information design (e.g., Pettersson, 2002), and technology are required 
for effective score reporting. They also presented a structured approach for developing score 
reports for cognitive diagnostic assessments. They provided guidelines for reporting and pre- 
senting diagnostic scores based on a revievv of educational score reporting practices and litera- 
ture from the area of information design and presented a sample diagnostic report to illustrate 
application of their approach. Because subscores constitute a type of diagnostic scores, several 
recommendations of Roberts and Gierl apply to subscore reporting as vvell, For example, their 
sample diagnostic report included three sections: (a) a top section contained an overvievv of the 
contents of the report, (b) the middle section contained diagnostic information along vvith item- 
level performance, and (c) the bottom section contained a narrative summary of the examinees 
performance across all the subareas. A score report that is supposed to include information on 
subscores could consist of similar sections as in Roberts and Gierl., 


On the Psychometric Quality of Subscores 


Despite the demand for and apparent usefulness of subscores, they have to satisfy certain 
quality standards in order for them to be reported. According to Haberman (2008), a subscore 
may be reported if it has high reliability and it is distinct from the other subscores. Similarly, 
Tate (2004) has emphasized the importance of ensuring reasonable subscore performance 
in terms of high reliability and validity to minimize incorrect instructional and remedia- 
tion decisions. These concerns are in agreement vvith Standards 1.14 and 2.3 of Standards for 
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Educational 8: Psychological Testing (2014), vrhich require proof of adequate reliability, valid- 
ity, and distinctness of subscores. Monaghan (2006) pointed out that “VVhile they vvant to be 
responsive to the desires of the educational marketplace, testing organizations are also very 
concerned about the appropriate use and interpretation of subscores:” yust as inaccurate infor- 
mation at the total test score level may lead to decisions vvith damaging consequences (e.g., 
vvrong)ly certifying someone to be a teacher or medical practitioner), inaccurate information 
at the subscore level can also lead to incorrect remedial decisions resulting in large and need- 
less expenses for examinees, states, or institutions (Sinharay, Puhan, 6: Haberman, 2011). 

Very İlovr reliabilities of subscores on educational tests are common because these subscores 
are often based on only a fevv items. For example, the state test score report shovvn in Figure 20 
of Goodman and Hambleton (2004) includes seven subscores that are based on only five items 
each. Such subscores are most often outcomes of retrofitting, vvhere reporting subscores vvas not 
a primary goal but vere later provided to comply vvith clients” requests for more diagnostic infor- 
mation on examinees. Furthermore, as Sinharay and Haberman (2008a) pointed out, real data 
may not provide information as fine-grained as suggested or hoped for by the assessment spe- 
cialist. A theory of response processes based on cognitive psychology may suggest several skills, 
but a test includes a limited number of items and the test may not have enough items to provide 
adequate information about all of these skills. For example, for the iSkills"N test (e.g., Katz, Attali, 
Ri/men, 6: VVilliamson, 2008), an expert committee identified seven performance areas that they 
thought comprised Information and Communications Technology literacy skill. Hovvever, a factor 
analysis of the data revealed only one factor and confirmatory factor models in vvhich the factors 
corresponded to performance areas or a combination thereof did not fit the data (Katz etcal., 2008). 
As a result, only an overall Information and Communications Technology literacy score used to be 
reported for the test until the test vvas discontinued in 2017. Clearly, an investigator attempting to 
report subscores has to make an informed yudgment on hov much evidence the data can reliably 
provide and report only as much information as is reliably supported. 

To demonstrate hovr lov reliability can lead to inaccurate diagnostic information, Sinharay, 
Puhan, and Haberman (2010) considered the Praxis Elementary Education: Content Knovvl- 
edge test. The 120 multiple-choice questions focus on four mafor subiect areas: language arts/ 
reading, mathematics, social studies, and science. There are 30 questions per area and a subscore 
is reported for each of these areas—the subscore reliabilities are betvveen 0.71 and 0.83. The 
authors ranked the questions on mathematics and science separately in the order of difficulty 
(proportion correct) and then created a Form A that consists of the questions ranked 1, 4, 7, ..., 
28 in mathematics and the questions ranked 1, 4, 7, ... ., 28 in science. Similarly, Form B vvas 
created vvith questions ranked 2, 5, 8, ... ., 29 in mathematics and in science, and a Form C vvith 
the remaining questions. Forms A, B, and C can be considered roughly parallel forms and, by 
construction, all of the several thousand examinees took all three of these forms (see Figure 3.2 
belovr for illustration). The subscore reliabilities on these forms range betvveen 0.46 and 0.60. 

"The authors considered all the 271 examinees vvho obtained a subscore of 7 on mathematics 
and 3 on science on Form A. Such examinees vvill most likely be thought to be strong on math- 
ematics and vveak on science and given additional science lessons. 1s that fustified? "The authors 
examined the mathematics and science subscores of the same examinees on Forms B and C. 
They found that: 


ə "The percent of the 271 examinees vvith a mathematics subscore of 5 or lovver is 34 and 39 
respectively for Forms B and C. 

ə "The percent vvith a science subscore of 6 or higher is 39 and 32 respectively for Forms Band C. 

ə The percent of examinees vvhose mathematics score is higher than their science score is 
only 59 and 66 respectively for Forms B and C. 
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Figure 3.2 Graph shoving the partition of the three sub forms from the original math and science forms. 


"This simple example demonstrates that remedial and instructional decisions based on short 
subtests vvill often be inaccurate. If subscores are used as a diagnostic tool, then the stakes 
involved in using subscores is less compared to scores that are used for high stakes use such as 
certification. Nevertheless, subscores should at least be moderately reliable (e.g., 0.80 or higher") 
to be ofany remedial use. Othervvise, users vvrould mostly be chasing noise and making incorrect 
remedial decisions. 

There are several statistical techniques starting from fairly basic statistical operations such as 
computing correlations and reliabilities to more sophisticated applications such as factor anal- 
ysis and dimensionality analysis that have been used to assess vvhether the subscores are vvorth 
reporting. Some of these techniques are summarized in the next section. 


Techniques to Evaluate VVhen Subscores Are VVorth Reporting 
Use of Correlation or Reliability of Subscores 


Researchers and practitioners often use simple rules to determine if subscores are vvorth report- 
ing. Several researchers have used the correlations corrected for attenuation betvveen the dif- 
ferent subscores to decide vvhether it is reasonable to report subscores. If the disattenuated 
correlations among subscores are fairly high, then it essentially means that the subscores are 
not distinct from each other and therefore not vvorth reporting. For example, McPeek, Alt- 
man, VVallmark, and VVingersky (1976) and Haladyna and Kramer (2004) used the criterion 
that reporting of subscores is not vvarranted if the correlations corrected for attenuation among 
them are larger than 0.90. Similarly, researchers often yudge the subscores to be useful if their 
reliabilities are sufficiently large. For example, VVainer et al. (2001, p. 363) commented that the 
subscores for three of the four sections of the item tryout administration of the North Carolina 
Test of Computer Skills are insufficiently reliable to allovv individual student reporting.” 


Application of Principal Component Analysis and Factor Analysis 


A simple approach to evaluate vvhether the subscores are distinct enough vvould be to compute 
the eigenvalues from the correlation matrix of the subscores (or from the correlation matrix 
of the items) in a principal component analysis. If most of the eigenvalues computed from the 
correlation matrix of the subscores are smaller than 1 or if a scree plot of these eigenvalues 
shovvs that the eigenvalues abruptly levels out at some point, then the claim of several distinct 
subscores is probably not fustified. On the contrary, the presence of multiple large eigenvalues 
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vvould support the reporting of subscores. For example, Sinharay, Haberman, and Puhan (2007) 
computed the eigenvalues from the 6 x 6 correlation matrix of 6 reported subscores from tvvo 
forms of a test for paraprofessionals. They found that the largest eigenvalue vvas 4.3 for both 
forms vvhile the remaining five eigenvalues vvere smaller than 0.5, suggesting that the test is 
essentially unidimensional and the claim of six distinct subscores is probably not fustifted. Sim- 
ilarly, Stone, Ye, Zhu, and Lane (2010) reported, using an exploratory factor analysis method 
on the inter-item correlation matrix, the presence of only one factor in the Fighth Grade Math- 
ematics portion of the Spring 2006 assessment of the Delavvare State Testing Program: hence, 
subscores vvere not vyorth reporting for the assessment. 


Application of the Beta-Binomial Model 


The method of fitting a mathematical model named the beta-binomial model (Lord, 1965) to 
the observed subscore distributions to determine if the subscores have added value over and 
beyond the total score has been suggested by Harris and Hanson (1991). Consider a test vvith 
tvvo subscores. If the bivariate distribution of the tvvo subscores computed under the assump- 
tion that the corresponding true subscores are functionally related provides an adequate fit to 
the observed bivariate distribution of subscores, the true subscores are functionally related and 
therefore do not provide any added value. Harris and Hanson used a chi-square-type statistic 
in their data example to determine the goodness of fit of the bivariate distribution of the tvvo 
subscores under the assumption that the corresponding true subscores are functionally related 
to the observed bivariate distribution of subscores. Hovvever, as pointed out by Sinharay, Puhan, 
and Haberman (2010), the method of Harris and Hanson involves significance testing vvith a 
chi-square statistic vvhose null distribution is not vvell established. 


Fitting of Multidimensional Item Response Theory Models 


Another vvay to examine if subscores have added value is to fit a multidimensional item response 
theory (MIRT) model (e.g., Reckase, 1997, Ackerman, Gierl, $x VValker, 2003) to the data. MIRT 
is. a tool to model examinee responses to a test that measures more than one ability, for example, 
a test that measures both mathematical and verbal ability. In MIRT, the probability of a correct 
item response is a function of several abilities, rather than a single measure of ability. To exam- 
ine if subscores have added value, one can perform a statistical test of vvhether a MIRT model 
provides a better fit to the data than a unidimensional IRT model. See von Davier (2008) for a 
demonstration of this sort of use of MIRT. 


Using Dimensionality Assessment Softyvares Such as DIMTEST and DETECT 


The DETECT softvvare (Zhang $: Stout, 1999) uses an algorithm that searches through all of 
the possible partitions of items into clusters to find the one that maximizes the DETECT statis- 
tic. Based on results from a simulation study, Kim (1994) provided guidelines to interpret the 
DETECT statistic. According to these guidelines, if the DETECT statistic is less than 0.10, then 
the data can be considered as unidimensional. Values betvveen 0.10 and 0.50 vvould indicate 
a vveak amount of dimensionality, values betvyeen 0.51 and 1.00 vvould indicate a moderate 
amount of dimensionality, and values higher than 1.00 vrould indicate high level of multidimen- 
sionality. The DIMTEST softvvare (Stout, 1987) implements a hypothesis testing procedure to 
evaluate the lack of unidimensionality in data from a test. It assesses the statistical significance 
of the possible dimensional distinctiveness betvveen tvvo specifled subtests (the Assessment 
Subtest or AT and the Partitioning Subtest or PT). The test statistic T calculated by DIMTEST 
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represents the degree of dimensional distinctiveness betvveen these tvvo specifled subsets. For 
example, if the testing practitioner vvants to knov if a test of general ability has distinct dimen- 
sions such as mathematics or reading that are dimensionally distinct from the rest of the items 
in the test, then the mathematics (or reading) items can form the assessment subtest and the 
remaining items can form the partitioning subtest, then, a significant value of the DIMTEST 
index T vvould indicate that the tvvo subsets are dimensionally distinct. 


Application of Habermans Classical Test Theory Based Method 


Haberman (2008) suggested a method based on classical test theory to determine if a subscore 
has added value. According to the method, a subscore has added value onİy if it can be predicted 
better by the corresponding subscore on a parallel form than by the total score on the parallel 
form (Sinharay, 2013). To apply the method, one examines vvhether the reliability of a subscore 
is larger than another reliability-like measure referred to as the proportional reduction in mean 
squared error (PRMSE) of the total score. Like the reliability coefficient, the PRMSE typically 
ranges from 0 to 1 vvith 0 and 1 indicating the lovvest and highest degrees of trustvvorthiness, 
respectively. Using the Haberman method, a subscore is said have added value and may be 
considered vvorth reporting only vvhen the subscore reliability is larger than the PRMSE of the 
total score, vrhich happens if and only if the observed subscore predicts the true subscore more 
accurately than does the observed total score. Sinharay, Haberman, and Puhan (2007) discussed 
vvhy the strategy suggested by Haberman is reasonable and hovr it ensures that a subscore sat- 
isfies professional standards. Conceptually, if the subscore is highly correlated vvith the total 
score (i.e., the subscore and the total score measure the same basic underlying skill), then the 
subscore does not provide any added value over vvhat is already provided by the total score. 
A subscore is more likely to have added value if it has high reliability and if it is distinct from the 
other subscores. Applications of the Haberman method can be found in Lyren (2009), Puhan, 
Sinharay, Haberman, and Larkin (2010), Meiyer, Boev, Tendeiro, Bosker, and Albers (2017), 
Sinharay (2010), Sinharay, Puhan, and Haberman (2010), and Sinharay, Puhan, and Haberman 
(2011). For example, Sinharay, Puhan and Haberman (2011) analyzed data from a teacher cer- 
tification test vvith four subscores. They found the values of the reliabilities of the subscores to 
be 0.71, 0.85, 0.67, and 0.72, for the reading, math, social studies and science subscores, respec- 
tively. The corresponding values of the PRMSE for the total score are 0.72, 0.76, 0.74, and 0.79, 
respectively. Thus, only the mathematics subscore of the test had added value. Similar results 
vvere also reported by Puhan, Sinharay, Haberman, and Larkin (2010) for several other teacher 
certification tests. 


Hov, Often Do Subscores Have Adequate Psychometric Quality 


In the previous section, vve presented methods that can be used to examine the psychometric 
quality of subscores to determine if they are vvorth reporting. In this section, vve vvill present 
research that shovvs hovv often subscores have adequate psychometric quality to be considered 
vyorth reporting. Both actual and simulated data vvere used in these studies. 

Sinharay (2010) performed an extensive survey regarding vvhether subscores have added 
value over the total score using the Haberman (2008) method. He used data from 25 large-scale 
operational tests (e.g., P-ACT 4 English, P-ACT 4 Math, SAT Verbal, SAT Math, SAT, SvveSAT, 
MFT Business, etc.), several of vvhich reported subscores operationally. Of the 25 tests, 16 had 
no subscores vvith added value and among the remaining nine tests, all of vyhich had at least an 
average of 24 items contributing to each subscore, only some of the subscores had added value. 
Sinharay also performed a detailed simulation study to find out vvhen subscores can be expected 
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to have added value. The simulation study shovved that in order to have added value, subscores 
have to be based on a sufficient number of (roughly 20) items, and be sufficiently distinct from 
one another—the disattenuated correlation betvveen subscores has to be less than about 0.85. 

Stone, Ye, Zhu, and Lane (2010) used an exploratory factor analysis method on the inter-item 
correlation matrix to report the presence of only one dominant factor for the 2006 eighth-grade 
Delayvare Math Assessment even though the test blue print called for four content domains. 
Harris and Hanson (1991), using their method of fitting beta-binomial distributions to the 
observed subscore distributions, found subscores to have little added value for the English and 
mathematics tests from the P-ACT-r examination. VVainer et al. (2001) performed factor anal- 
ysis and an examination of the reliability of the subscores on data from one administration of 
the yust-in-Time Examination conducted by the American Production and Inventory Control 
Society (APICS) certification. They concluded that the six subscales in the APICS examination 
did not appear to measure different dimensions of individual differences. Hovvever, they found 
a tryout form of the North Carolina Test of Computer Skills that has four subscales vvas not as 
unidimensional as the APICS examination, and an application of the method of Haberman 
(2008) to the data reveals that three of the four subscores have added value over the total score. 
VVainer, Sheehan, and VVang (2000) considered the problem of constructing skills-based sub- 
scores for the Education in the Elementary School Assessment that is designed for prospective 
teachers of children in primary grades (K-3) or upper-elementary/middle-school grades (4-8). 
They concluded, mostly from an analysis of reliability of the subscores, that the tests items vvere 
essentially unidimensional and therefore, skill-based subscores could not be supported. Acker- 
man and Shu (2009), using DIMTEST and DETECT found subscores also not to be useful for a 
fifth-grade end-of-grade assessment. 

Another study conducted by Sinharay and Haberman (2009) tried to ansvver the question “Is 
it possible to inform lovv-scoring test takers about the subareas in vvhich those vvith similar total 
scores have been vveak?” The assumption is that lovv-scoring examinees (i.e., repeaters) may 
perform differently on different subscores and therefore reporting subscores might be beneficial 
to them for planning remediation. The study tested this assumption using data from three forms 
of an elementary education test for prospective teachers. The authors divided the data into 20 
groups ranked on the basis of their scaled scores and computed, for each examinee group, the 
average values of the four subscores on the test—a line /oining the four average subscores vvas 
dravvn for each group. Sinharay and Haberman did not observe any noticeable difference in the 
pattern of the average subscores in the different groups—-the line for each examinee group vvas 
roughly parallel to a horizontal line, vhich means that the test takers vvho obtain lov scaled 
scores on the test perform almost equally poorlİy in all the subfect areas on average. The authors 
concluded that it is not fustified to share vvith lovv-scoring test takers the subareas vvhere those 
vvith similar scores have been historically vveak, simply because no such subareas exist. "These 
studies collectively indicate that subscores on operational tests have more often been found not 
to be useful than to be useful and do not satisfy professional standards. 


Alternatives to Simple Subscores 


"The studies in the previous section shovv that subscores on many high-stakes operational tests 
are probably not vvorth reporting because they are either unreliable or not distinct from the 
other subscores or a combination of both. One may vvonder vvhether there exist vvays to improve 
the psychometric properties of subscores so that they are vvorth reporting. Some psychometric 
techniques have been developed that can be used to enhance the reliability of subscores. But the 
usefulness of these techniques depends partially on the current state of the subscores. VVe group 
the current state of the subscores under the follovring three categories: 
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1. If an observed subscore is reliable and dimensionally distinct from the remaining sub- 
scores, then it may seem reasonable to simpİy report the observed subscores. A statistical 
technique to enhance the psychometric quality of these subscores may not be necessary. 

2. Ifan observed subscore is unreliable and not dimensionally distinct from the remaining 
subscores, then it may not be fustified to report the observed subscores or even a sta- 
tistically enhanced subscore because a statistical technique cannot be expected to make 
up something that is simply not there. Examples of such tests are the battery of tests for 
measuring examinee and school progress that vvas considered in Sinharay and Haberman 
(2008b) —several correlations among the subscores vvere found to be larger than 1 after 
correction for attenuation. 

3. Statistical techniques to improve the reliability ofthe subscores may have some utility ifthe 
observed subscore is moderately reliable and is moderately correlated vvith the remaining 
subscores. For example, for three teacher licensing tests considered in Puhan et al. (2010), 
no subscore vvas vvorth reporting according to the PRMSE criterion of Haberman (2008), 
but all augmented subscores (described shortly) had improved reliability and vvere vvorth 
reporting. In the next section vve describe some techniques that have been suggested to 
improve the reliability of subscores. 


Augmented Subscores and VVeighted Averages 


VVainer et al. (2000) suggested an approach to increase the precision of a subscore by borrovving 
information from other subscores—the approach leads to “augmented subscores” that are linear 
combinations ofall subscores. Because subscores are almost alvvays found to correlate at least mod- 
erately, it is reasonable to assume that, for example, the science subscore of a student has some 
information about the math subscore of the same student. In this approach, vveights (or regression 
coefficients) are assigned to each of the subscores and an examinee8 augmented score on a partic- 
ular subscale (e.g., math) vvould be a function of that examinees ability on math and that personis 
ability on the remaining subscales (e.g., science, reading, etc.). The subscales that have the strongest 
correlation vvith the math subscale have larger vveights and thus provide more information on the 
“augmented” math subscore. Haberman (2008) suggested a vveighted average that is a linear com- 
bination of a subscore and the total score vvhere vveights for the subscore and total score depend 
on the reliabilities and standard deviations of the subscore and the total score and the correlations 
betvveen the subscores. Sinharay (2010) shovved that the augmented subscores and vveighted aver- 
ages often are substantially more reliable than the subscores themselves. A possible limitation of 
augmented subscores or vveighted averages is that it is hard to explain to users exactly vvhat these 
scores mean (although some progress on this front has been made by Sinharay, 2018). For exam- 
ple, it may be difficult to explain to an examinee vvhy her reported math score is based not only 
on her observed math score but also on her observed scores on other sub-sections such as reading 
and vvriting. Also, the test organizers and test-score users may not like the idea of borrovying from 
other scores. In addition, researchers such as Stone et al. (2010) raised the concern that augmented 
subscores and vveighted averages may hide differences betvveen the subscores by forcing the differ- 
ent augmented subscores of examinees to appear similar to each other. The similarity betvveen the 
different augmented subscores of an examinee is a price to pay for their greater accuracy. 


Obiective Performance Index 


The obyective performance index or OP1 (Yen, 1987) is another approach to enhancing a sub- 
score by borrovving information from other parts of the test. This approach uses a combination 
of item response theory (IRT) and Bayesian methodology. The OPT is a vveighted average of tvvo 
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estimates of performance: (a) the observed subscore and (b) an estimate, obtained using a uni- 
dimensional TRT model, of the subscore based on the examinees overall test performance. Ifthe 
observed and estimated subscores differ significantly, then the OPT is defined as the observed 
subscore expressed as a percentage. One limitation of this approach is that because of the use of 
a unidimensional TRT model, it may not provide accurate results vvhen the data are truly multi- 
dimensional, vvhich is vvhen subscores can be expected to have added value. 


Estimated Skill Parameters From a Cognitive Diagnostic Model 


It is possible to employ a psychometric model such as a cognitive diagnostic model (e.g., Fu 6k 
Li, 2007, Roberts 8: Gierl, 2010) or a diagnostic classification model (Rupp, Templin, 8: Henson, 
2010) to report diagnostic scores instead of reporting subscores. These models assume that (a) 
solving each test item requires one or more skills, (b) each examinee has a discrete latent skill 
parameter corresponding to each of the skills, and (c) the probability that an examinee vill 
ansvver an item correctly is a mathematical function of the skills the item requires and the latent 
skill parameters of the examinee. For example, a reading test may require the skilİs such as 
remembering details (skill1), knovving fact from opinion (skill 2), and speculating from contex- 
tual clues (skill 3) (McGlohen 8: Chang, 2008), and the probability that an examinee vvill ansvver 
a certain item on the reading test correctly, is determined based on the skills that item requires 
and if the examinee do have those skills. After a diagnostic classification model is fit to a data 
set, the estimated values of the skill parameters are the diagnostic scores that can be reported. 
Examples of such models are the rule space model (RSM, Tatsuoka, 1983), the attribute hierar- 
chy method (AHM, Leighton, Gierl, 6: Hunka, 2004), the DINA and NTDA models (Tunker 8: 
Siftsma, 2001), the general diagnostic model (von Davier, 2008), and the reparametrized unifted 
model (Roussos et al,, 2007). The first tvvo of these, the RSM and AHM, are slightly different 
from the other diagnostic classification models in nature because they do not estimate any skill 
parameters—they match the response pattern of each examinee to several ideal or expected 
response patterns to determine vvhat skills the examinee possesses. VVhile there has been sub- 
stantial research on diagnostic classification models, as Rupp and Templin (2009) acknovrledge, 
there has not been a very convincing case that unequivocally illustrates hovr the added para- 
metric complexity of these models, compared to simpler measurement models, can be fustified 
in practice. In addition, there have been fev empirical illustrations that the diagnostic scores 
produced by these models are reliable and valid (see, e.g., Haberman 8: von Davier, 2007, Sin- 
haray 8: Haberman, 2008a). Nonetheless, researchers have continued interest in CDMs and, as 
mentioned earlier, Roberts and Gierl (2010) provided guidelines for presenting and reporting 
diagnostic scores for assessments that employ CDMs, and several of those guidelines are appli- 
cable to reporting of subscores. 


Estimates From a MIRT Model 


Several researchers such as Luecht (2003), Yao and Boughton (2007), Yao (2010) and Haber- 
man and Sinharay (2010) have examined the use of MIRT models (Reckase, 1997) to report 
subscores. Yao (2010) shovved that MIRT models can be used to report a set of reliable overall as 
vvell as domain scores. Haberman and Sinharay (2010) also evaluated vvhen subscores computed 
using a MIRT model have any added value over the total score or over subscores based on classi- 
cal test theory and found that there is not much difference betvveen MIRT-based subscores and 
augmented subscores (VVainer et al., 2001). Haberman and Sinharay also suggested reporting 
of estimated true subscores that are on the same scale as the number-correct subscores using 
MIRT models. 
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Scale Anchoring 


Scale anchoring (e.g., Beaton 6z Allen, 1992, Hambleton, Sireci, 8: Huff, 2008) makes claims 
about vrhat students at different score points knovv and can do and is an approach that can 
be used to report more information than the total score vyhen subscores are not of adequate 
psychometric quality. Scale anchoring typically is carried out by (a) selecting a fevv dispersed 
points on the score scale (anchor points) that vvill be anchored, (b) finding examinees vvho score 
near each anchor point, (c) examining each item to see if it discriminates betvveen successive 
anchor points, that is, if most of the students at the higher score levels can ansvver it correctiy 
and most of the students at the lovver level cannot, and (d) revievving the items that discrimi- 
nate betvveen adiacent anchor points to find out if specific tasks or attributes that they include 
can be generalized to describe the level of proficiency at the anchor point (e.g., Phillips et al, 
1993). The outcome from this revievv is a description of vvhat students at various scale points 
knovr and can do (see, for example, Hambleton $: Zenisky, 2018). Although scale anchoring 
seems promising, there can be confusion about the meaning of data related to score anchors, 
and care must be used in offering correct anchor score interpretations (Linn 6: Dunbar, 1992). 
Phillips et al. (1993) also described the danger of over-interpreting examinee performance at 
anchor points so that all examinees at a particular level are assumed to be proficient at all abil- 
ities measured at that level. Sinharay, Haberman, and Lee (2011) described statistical proce- 
dures that can be used to determine if scale anchoring is likely to be successful for a test. They 
used several data sets from a teacher certification program and concluded that scale anchoring 
is not expected to provide much useful information to the examinees for this series of examina- 
tions. Although the discouraging results for the data they considered do not necessarily imply 
that the same results vvill alvvays be observed, they do indicate that success in scale anchoring 
is far from guaranteed. 


Conclusions and Recommendations 


VVhile subscores are highly sought after by test users and are reported operationally for several 
large-scale assessments, not all subscores that are reported are ofadequate psychometric quality. 
Based on the existing research, our recommendations on communicating subscores are pro- 
vided in the follovving list. VVe do not focus (in the follovving list and elsevvhere in this chapter) 
on hovv subscores should be communicated—such discussions can be found in several other 
chapters in this volume and in Goodman and Hambleton (2004) and Roberts and Gierl (2010). 
İnstead, our recommendations mostly focus on vrhen to communicate subscores. 


1. Use of content blueprints as a basis for subscores does not necessarily guarantee that the 
different subscores vvill produce highly distinct subscores. For example, Sinharay, Haber- 
man, and Puhan (2007) shovved that subscores based on highly distinct content categories 
such as math, reading and vvriting still produced scores that vvere highly correlated to each 
other. Subscores based on psychometric approaches might be more useful. For exam- 
ple, VVainer, Sheehan, and VVang (2000) proposed a tree-based regression analysis vvhere 
items are clustered in a vvay that minimizes vvithin-cluster variation vvhile simultaneously 
maximizing betvveen-cluster variation. Hovvever, psychometric-based approaches may 
lead to subscores that vvould be difficult to interpret. They may also not be reproducible 
vyrhen applied to other data. Techniques such as evidence-centered design (e.g., Mislevy, 
Steinberg, 6: Almond, 2003) or assessment engineering practices for item and test design 
(e.g., Luecht, Gierl, Tan, 6: Huff, 2006) should be used to ensure that the subscores have 
satisfactory psychometric property. 
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2. Subscores that are reported should be of adequate psychometric quality. In other vvords, 
the reported subscores should provide evidence of adequate reliability, validity, and dis- 
tinctiveness of the subscores. Any reported subscore, in order to be reliable, should be 
based on a sufficient number of carefully constructed items. Although there are no clear 
guidelines in the psychometric literature on hovr many items are needed to provide a reli- 
able subscore, some research has shovvn that at least 20 multiple-choice items are typically 
needed in each subtest for the corresponding subscore to yield a moderately high reliabil- 
ity (e.g., Puhan, Sinharay, Haberman, 8: Larkin, 2010, Sinharay, 2010). Combining some 
subscores can increase subdomain test length and therefore might result in subscores that 
have higher reliability and hence added value. For example, subscores for “Physics: the- 
ory” and “Physics: applications” may be combined to yield one subscore for “Physics” 
Combining subscores to meet reliability considerations, hovvever, can create interpretation 
problems. It is also important to ensure that the skills of interest are as distinct as possible 
from each other, though this is quite a difficult task to accomplish before seeing the data. 

3. One can consider reporting of vveighted averages or augmented subscores that often have 
added value (e.g., Sinharay, 2010) and often provide more accurate diagnostic informa- 
tion than the subscores do. VVeighted averages may be difficult to explain to the general 
public, vvho may not like the idea that, for example, a reported reading subscore that is 
based not only on the observed reading subscore, but also on the observed vvriting sub- 
score. Hovvever, several approaches to the issue of explaining such vveighted averages can 
be considered. One is that the vveighted average better estimates examinee proficiency in 
the content domain represented by the subscore than does the subscore itself. This result 
can be discussed in terms of prediction of performance on an alternative test. Sinharay, 
Haberman, and VVainer (2011) demonstrated using data from an operational test that 
the correlation betvveen a subscore and the corresponding subscore on a parallel form is 
smaller than the correlation betvveen the corresponding vveighted average on the original 
form and the corresponding subscore on a parallel form: this finding is supported by 
theoretical results of Sinharay (2018). The issue can also be discussed in terms of com- 
mon cases in vvhich information is customarily combined. For example, premiums for 
automobile insurance reflect not iust the driving experience of the policy holder but also 
related information (such as education and marital status) that predicts future driving 
performance. In most cases, this difficulty in explanation of the vveighted averages is more 
than compensated for by the higher reliability of the vveighted average. 

4. Subscores should be reported on an established scale and also equated so that the defi- 
nition of strong or vveak performance in a subiect area does not change across different 
administrations ofa test. Note that if the subscores are based on only a fevv items, equating 
may not be of satisfactory psychometric quality for the subscores.” For example, if com- 
mon items are used to equate the total test, only a fevv of the items vvill correspond to a 
particular subarea so that common-item equating (e.g., Kolen 6: Brennan, 2014) of the cor- 
responding subscore is not feasible. Some research on equating of subscores and vveighted 
averages has been done by Puhan and Liang (2011) and Sinharay and Haberman (2011). 

5. Although vve primarily discussed subscores for individual examinees, subscores can be 
reported at an aggregate level as vvell (e.g., institution or district levels). Longford (1990) 
and Haberman, Sinharay, and Puhan (2009) have suggested methods to examine vrhether 
aggregate-level subscores are of added value, and presented examples of situations vvhen 
aggregate-level subscores do not have added value. So, examining the quality of the aggre- 
gate-level subscores before reporting them is important. 

6. Finally, this chapter primarily focused on large-scale assessments. Although other types 
of assessments, such as formative assessments, are beyond the scope of this chapter, one 
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can appİy some of the techniques and procedures discussed in this chapter to evaluate the 
utility of subscores in formative assessments. yust as subscore information in large-scale 
assessments can aid in training and remediation, subscores in formative assessments can 
help teachers plan instruction that takes into account the strengths and vveaknesses of 
their students. For this purpose, they vvill need subscores that are reliable and distinct 
from the other subscores. Providing those subscores might be even more difficult in for- 
mative assessment, vvhere it is impractical to administer long tests on an ongoing basis 
throughout the year. 


Notes 


1 The cutoff of 0.80 seems reasonable from a discussion in Nunnally (1978, p. 245). 

2 VyVainer et al. did not clearly mention vvhat cutoff they used: hovvever, given that the reliabilities of the four section 
scores are 0.52, 0.60, 0.77, and 0.85, it seems that they most likely used the criterion of 0.80 that is often attributed to 
Nunnally (1978). 

3 "at is another reason of not reporting subscores based on a fevv items. 
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Reporting Student Grovth 
Challenges and Opportunities 


April L. Zenisky, Lisa A. Keller, and Yooyoung Park 


The aim of score reporting in the context of K-12 educational testing is to provide stakeholders 
ranging from students and families to teachers, schools, districts, and states, as vvell as the gen- 
eral public, vvith information about student performance. "he individual student score report 
documents that have served as the primary communication vehicle for such results follovv 
something of a typical script across states, testing companies, and reporting vendors in terms 
of both content and format: High prominence is usually afforded to reporting of a scale score 
(often represented numerically and graphically) and a proficiency level classification, follovved 
by or integrated vvith some normative results comparing a students score to that of defined and 
relevant reference groups (e.g., school, district, state), and typically, there is also some kind ofa 
subscale reporting, often using a number-correct metric. Individual state reports for individual 
students, of course, do vary from one another vvith the addition of elements such as results for 
released test items, narratives of performance strengths and vveaknesses, and individualized 
guidance for next steps, as vvell as differences in structure and layout. 

One element that is increasingİy common on student test score reports is displays of student 
grovvth. Such displays, vvhen included on score reports, dravv on the test score history for each 
individual student and (sometimes) a statistical cohort group to compute and contextualize 
patterns of scores over time and often, to profect future performance. There are numerous strat- 
egies and methods that have been developed and deployed operationally to compute indices 
of grovvth at both the individual and group (classroom, school, district, etc.) level, O”Malley, 
Murphy, McClarty, Murphy, and McBride (2011), in their overvievv of student grovvth models, 
characterized approaches to calculating student grovvth as three general types: 


ə Grovvth to Proficiency models: Student performance is compared to a yearly grovvth target, 
in order to reach a defined “proficiency” vvithin a set number of years. 

ə Value/Transition Tables: Grovvth is considered relative to change in performance category 
assignment over years (e.g., movement from “Needs Improvement” to “Proficient”). 

ə Profection models: Student performance is predicted using past and current student per- 
formance and the performance of prior cohorts in the target grades. 
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Each one of these approaches comes part and parcel vvith key statistical assumptions that 
guide appropriate interpretation and use, and these assumptions take on critical importance 
vyhen these results are included on reports distributed to families, alongside other indicators of 
student performance. 

"The present chapter begins vvith an acknovvledgement that student grovvth results are at pres- 
ent being included on reports for a vvide range of stakeholders and are being used for purposes 
ranging from informational to high-stakes. At the outset is provided a very brief revievv of student 
grovvth approaches and models dravvn from the psychometric literature, to set the stage for the 
practice of reporting grovvth operationally. VVhile our purpose is not to litigate methodology or 
the underlying statistical assumptions of grovth models, their use in various educational policy 
decisions has at times been challenged because of conceptual and computational complexity. This 
necessarily connects to questions pertaining to reporting, including the extent to vvhich such 
results may be understood and ultimately used correctly by the various intended audiencets). 
From there, the chapter focuses on current practices in reporting grovvth, discussing strategies 
used and the implications for interpretation and use associated vvith different display approaches. 
"The next portion of the chapter vvill report the results of a small-scale survey study carried out to 
investigate the extent to vvhich various displays of student grovvth results can be understood and 
interpreted correctly. The last section of this chapter aims to apply the broader body of research 
on score reporting to the specific topic of grovvth reporting, synthesizing best practices in report- 
ing and the results of the study presented here to evaluate the communication of grovvth results 
to consumers of educational test data, and ultimately to lay out a research agenda in this area. 


A Brief Overvievv of Grovrth 


The term “grovth” has come to enter the current educational vernacular over the course of the 
past 10 to 15 years, albeit vvithout much in the vvay of a standard or formalized meaning. At pres- 
ent, a variety of grovvth models have been developed to aid in the interpretation of scores from 
high-stakes assessments beyond simple status measures, and all of these models define grovvth 
in slightly different vvays. Often, the various models do not make explicit vhat type of “grovvth” 
is being measured in the model, further, it is not uncommon that the statistically implemented 
definitions of grovvth do not conform to the common conceptualizations of grovvth, and using 
the term may at times be confusing or misleading to lay audiences (Keller, Colvin, 6: Garcia, 
2016). Grovvth models are used for a variety of purposes and at many levels (for example, at the 
level of the individual student as vvell as aggregated to the school, district, or state). 

"The use of grovvth measurements in large-scale testing stemmed from a concern that status 
measures vvere insufficient to capture the vvork that schools vvere doing to advance students. 
VVith No Child Left Behind, status measures vvere primarily used to monitor school progress, 
through measures such as “percent proficient” Schools vvere expected to increase the percent 
of students proficient on statevvide tests at various grade levels in various subfects from year to 
year. Such changes (and really, any observed improvements) in the number of students deemed 
proficient vvere indicators of progress of schools. In using such a system, hovvever, there vvas no 
credit given to students vvho made progress but did not change from not proficient to proficient. 
This issue vvas seen as especially problematic for lovveer-performing schools vvho vvould have dif- 
ficulty increasing the percent of students in the proficient category, but that vvere, nonetheless, 
improving the achievement of students. As a result, the family of approaches novr knovmn as 
grovvth measures vvas designed to give credit to schools that vvere making progress, and also to 
provide students and their families vvith more information about their progress. 

VVhile the idea of measuring academic grovvth as change is quite appealing, the reality of 
measuring academic grovvth is much more complex. Measuring physical gains, like height, for 
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example, is relatively easy, vvhereas measuring changes in academic performance is less so, for 
several reasons. First, the definition of the construct being measured is more complex. The 
construct of height is vvell-understood, clearly defined, and easy to compute. Changes in the 
construct are also easy to determine given that a measurement at time 1 can be easily compared 
to a measurement at time 2. For example, a child vvho is 36 inches at time 1 and 42 inches at time 
2 grevv 6 inches, and an inch is a readily-relatable unit for measuring human height. İt might not 
be clear vvhether or not that is a lot for a child to grovr in that time span, and other information 
vvould be required to provide context for that change in height. Some contextual factors might 
be the age of the child, the sex of the child, and the ethnicity of the child. 

Turning to a different kind of construct to be measured, such as math achievement: In this 
context, the knovvledge and skill of interest cannot typically be directly assessed (so a test instru- 
ment provides a proxy for that measurement), and thus it may be less clear vhat it is vve are 
exactly measuring (although it does not seem controversial that our statevvide tests are mea- 
suring academic knovvledge). Hovvever, hovv to conceptualize that change is not as simple. "The 
most intuitive vvay to conceptualize grovvth is to do as vve do vvith the height example: Give a 
test at the beginning of the year and at the end of the year, and see hovv many more questions 
the student got right. Novy suppose the student got 10 items right (out of 50) in the fall and 40 
items right in the spring. The difference in performance is 30 items, vvhere the student vvas able 
to ansvver 30 more items correct in the spring data collection than in fall, This change could then 
be contextualized as is done vvith height to determine if that is a large change or not. Hovvever, 
a student cannot be administered the exact same test form in the fall and spring, as they might 
remember the questions, and further, the amount of testing time might be extraordinary to do 
this across subfects. (As an aside, there are vvell-knovvn underlying statistical issues to using 
these simple gain scores (e.g. Cohen, Cohen, VVest, 6: Aiken, 2003)). "Therefore, there is a desire 
to use existing tests, designed to measure status, to also measure the academic grovvth of stu- 
dents. Some vvays grovvth could be, and currently are, conceptualized included: 


ə A higher score on the test compared to last time you took the test 
ə A change in proficiency category 

ə A greater mastery of the content 

ə Doing better than the other students in the class. 


There are many resources available to read about the variety of grovvth models that are avail- 
able for use, and that are in use (e.g. Castellano 6: Ho, 2013), and so they are not reiterated here. 

One important defining characteristic of grovvth measures is the notion of vvhether the 
measure is a norm-referenced or criterion-referenced measure. Criterion-referenced measures 
vvould look at gains relative to the curriculum, such as hovv much more of the curriculum has 
the student mastered at “time tvvo” as compared to vvhat vvas mastered at “time one” These are 
the kinds of grovvth measures vve might intuitively think about vvhen vve think about improve- 
ment. Norm-referenced measures, on the other hand, consider questions like “Hovv did my 
performance, or change in performance, compare to my peers?” The definition of “peers” might 
vary across measures but can be either all students in the same grade, or all students vvith a 
similar score history, or all students vvith similar demographics. 


Issues in Reporting Grovvth 


To place grovrth reporting in the larger context of results reporting, it is important to note 
that vvhile individual score reports have been furnished to students and families for as long as 
student testing has been taking place, there has been considerable evolution in the appearance, 
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contents, format, and distribution of score reports, and these changes have been particularly 
pronounced in the past 20 years or so. One especially notable publication that vvas evidence of 
such change and consequently further raised avvareness of the importance of thoughtful and 
deliberate score reporting is the 1998 publication of National Educational Goals Panel (NEGP), 
Talking About State Tesis: An Idea Book for State Leaders. VVhile framed by educational goals that 
vvere set for the year 2000, this document is remarkable in establishing several clear principles 
for reporting that remain highly relevant to reporting practices today, divided among sfrafegic 
and content recommendations. The NEGP suggested that states ansvver the follovring four ques- 
tions through the reports distributed to families (NEGEP, 1998, p. xi): 


1. Hovv did my child do? 

2. VVhat types of skilİs or knovvledge does my childs performance reflect? 

3. Hovv did my child perform in comparison to other students in the school, district, state, 
and—if comparable data are available—the nation? 

4. VVhat can 1 do to help one of these children improve? 


VVhile these vvould seem to be obvious today, the NEGP5S /dea Book helped to formalize 
these questions as guiding principles for report development, and ultimately helped launch a 
sea change in report development practices. VVith the 2002 passage and implementation of the 
No Child Left Behind Act and its subsequent reauthorizations, student testing vvas elevated to 
a place of significant public prominence in the US educational landscape, and consequently 
research attention on communicating test scores has increased dramatically. Rather than an 
uninteresting duty for testing agencies that is left to the end of test development, reporting today 
is increasingly vievved as a critical element of communication about vrhat students knovv and 
can do, and agencies have increasingly devoted resources to advancing good test score reporting 
to facilitate action on the basis of the information presented. This has led to the development of 
several models for reporting (Zenisky 6: Hambleton, 2012: Zenisky 8: Hambleton, 2016, etc.) 
as vvell as a number of empirical studies about specific reporting elements including vvork by 
Zvvick, Zapata-Rivera, and Hegarty (2014) and Zapata-Rivera, Kannan, and Zvvick (this vol- 
ume) on error, and consideration of the interaction of report contents and specific stakeholder 
groups (e.g., Goodman $: Hambleton, 2004: Rick et al,, 2016). 

As noted at the outset of this chapter, typical test score reports for students contain a number 
of basic or common pieces of information. In recent years, the choice has been made by many 
state educational agencies in the United States to incorporate displays of results pertaining to 
grovvth on those reporting documents, in addition to the more familiar reporting elements that 
are generally present. This shift to include grovvth results on reports has been driven in part by 
policy decisions at the level of federal and state agencies and the availability of these results, but 
unlike other indicators of academic performance, grovvth results seem not yet to have found 
their vvay into many studies that evaluate the use and understanding of specific score report 
displays. In some typical approaches, grovvth is presented on individual score reports in the 
form of line graphs (e.g., Colorado, vvvyv.ede.state.co.us/accountability/understanding-grovvth- 
reports), but other displays in use include bar charts (Georgia, vvvvvs.gadoe.org/Curriculum- 
Instruction-and- Assessment/ Assessment/ Documents/ GSGM/GSGM, EOG. SampleReport 16. 
pdf) and tables/text (http://understandthescore.org/score-report-guide/). 

As vvith other elements of student score reports, report developers are faced vvith many 
content and design choices vvhen formatting grovvth display results for inclusion on student 
score reports. In terms of communicating results for measures of student grovvth to educa- 
tional stakeholders, regardless of the model or strategy used, there are several key consider- 
ations that should impact hov these elements of reports should be implemented. Chief among 
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these considerations is model choice, because the nature of the statistical information to be com- 
municated is a driving factor in vvhat can and should be shared on a score report for families, 
educators, and/or educational administrators. Related to this is an accounting of the statistical 
assumptions of each model, vvhich likevvise define vvhat is and is not possible from a reporting 
standpoint, in particular relative to interpretation and use. Next, reporting for grovrth model 
results vvill vary considerably based on the appropriateness/relevance for individual reports and 
group reports, and the landscape of approaches (text, graphical displays, and/or tables) encom- 
passes a range of strategies. The final consideration in reporting such results is the issue of score 
error for the grovvth model calculations. "The standard error or measurement for scale scores 
sporadically appear on student score reports at present, and this pattern seems to be continuing 
in the context of reporting grovvth results. The standard error of measurement on reports is 
already problematic for users (in general, users do not like the inclusion of the SEM, and often it 
is not understood), and grovvth scores are even less reliable than test scores and so the need for 
error bands is even greater than vvith single scores. 

Part of the challenge of reporting grovvth is consistent vvith findings of other studies of 
reporting elements, in that some report elements pose particular difficulties in terms of data 
interpretation and use. Research has shovvn that misinterpretations are common vvhen users are 
asked direct knovledge questions about various score report elements. To our knovvledge there 
have been no published studies of reporting that have focused explicitly on grovvth reporting 
displays, though considerable efforts have been made by state education agencies to develop 
interpretive guides (across text, presentations, and video formats) to explain reports. Little is 
knovn about best practices for reporting grovvth, nor vhat elements of grovvth displays lend 
themselves to correct interpretations or misinterpretations by intended users. It should be noted 
that grovvth displays are being included in individual score reports sent to families as vvell as 
made available in dashboard-type interactive formats for teachers, school administrators, and 
district and state personnel, and the training and expectations for use in all of those scenarios 
are quite different. 

To return to the guiding principles espoused by the questions posed by the NEGP in the 1998 
publication, the inclusion of grovvth reports as a display element in individual score reports or 
students” needs to be evaluated for purpose. VVhat do displays of grovrth mean, hovr do intended 
users understand them, and hovr does inclusion of those results on reports help intended users 
to move forvvard on the basis of that information? Ultimately, as vvith all aspects of student 
reports, vvhen included grovvth results should accomplish a specific informational purpose, and 
evidence needs to be gathered to shovv that such results are understood and used as intended. As 
noted previousiİy, there is not a İot of research surrounding best practices for reporting grovvth 
measures. Given the complexity of some of the grovvth measures in use, the means for com- 
municating that information is crucial for proper understanding and use of the information. 
Research in this area vvill help practitioners learn of best vvays to communicate this complex 
information in a vvay that is appropriate. In the next section, a small-scale study to begin to 
understand reporting practices in this area is described. 


A Small-scale Study 


To obtain information about hovv interpretable score reports are a small study vvas conducted 
online using Amazon Mechanical Turk (MTurk). Six score report displays that reported student 
grovvth vvere selected from publicly available score reports, and for each score report, items to 
evaluate the interpretability of these score reports vvere created and a random sampling of 2 dis- 
plays and their associated statements (for agreement/disagreement) vvere presented to research 
participants. 
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Respondents vvere required to be at least 18 years old and İive in the United States. 220 adult 
respondents living in the United States participated in the study. The median age grouping of 
the participants vvas betvveen 25 and 34, vvhich comprises about 4596 of all participants in the 
study. Most of the respondents reported at least some college education, if not a degree, as 4406 
of the respondents indicated that they held a Bachelors degree, about 2696 of the respondents 
indicated they had education at the college level vvithout degree, and 13906 of the respondents 
held an associates degree from a tvvo-year college. A mafority of the respondents (57596) indi- 
cated having some course in statistics, and in terms of the highest level of statistics courses 
taken, about half (49.596) of the participants indicated their undergraduate statistics courses 
vvere their highest level of statistics courses, follovved by 2406 of the participants vvith no statis- 
tics course experience, and follovved next by almost 2006 vvith statistics courses in high school, 
VVith regards to sex, 5096 of the total respondents vvere male and 5006 vvere female. By ethnic- 
ity, fust over 8196 of the respondents vvere vvhite, nearly 1296 vvere native Havvaiian or Pacific 
Islander, about 896 vvere Asian and 196 vvere African American. Finally, 5.996 of the respondents 
also reported a Spanish, Hispanic, or Latino background. 

Six different displays of score reports that vvere publicly available vvere chosen for the study. 
AlI the displays used grovvth percentiles as their grovvth measure, making it possible to focus 
on evaluating the differences in the interpretability of displays, rather than metric differences. 
Student Grovvth Percentiles (SGPs: Betebenner, 2009) are one of the more popular measures of 
grovvth, as SGPs can be used for any test vvith multiple administrations, regardless of the type 
of scale that it is used. Statistically it is a complex model that utilizes quantile regression as the 
foundation, resulting in the potential for users to misinterpret the meaning of the grovth per- 
centile unless guidance and information about hovv to interpret the value is provided. The sta- 
tistical details of the model can be found in Betebenner (2009). As noted by Goldschmidt et al, 
(2012) the SGP does not measure absolute grovrth in performance but provides a normative 
context to compare student test scores of students vvith similar score histories. Percentile ranks 
are assigned to students. For example, a student vvith an SGP of 70 performed better than 7096 
of his/her peers that had similar score histories. As such, many students can get an SGP of 70, 
but it does not mean that they have shovn the same changes in performance. 

Since the goal of the study vvas not to provide a critique of specific states displays, but instead 
to evaluate broad interpretability of common grovvth reporting approaches, the images of 
the actual displays used are not provided. Hovvever, a mock-up of the various types of dis- 
plays is provided in Figure 4.1, to provide context in understanding the results of the research. 
A description of the specific displays is also provided follovving Figure 4.1, and vill reference 
this fıgure to provide context for the various types of displays. 

Each participant vvas presented vvith tvvo of the score displays, randomly selected from 
among the six displays prepared. Six statements vvere presented for each display, and survey 
participants vvere instructed to click on the one or more statements that they decided vvere 
true, given the information presented in the display. (At least one statement presented for each 
display vvas true.) 

These display-specific statements vvere developed to be as parallel as possible across the 
displays, rather than have unique statements for each display. Hovvever, since some displays 
provided information that others did not, it vvas not possible to have completely parallel state- 
ments across displays. To this end, a framevvork vvas developed and used for constructing the 
specific statements across displays. "he five information categories described belovr in Table 4.1 
informed the formation of the six statements for each display, along vvith sample statements for 
each information category. 

Although performance indicator results vvere not essential to this study, statements about 
the performance of the students vvere included to determine if the respondents could broadly 
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Figure 4.1 Mock-Ups of displays used (Clockvvise from top: Table, Line, Text, Bar, Proiection). 


Table 4.1 Information categories of statements vvith examples 


Information Category Example Statement 

Interpretation of performance The students performance vvas in the “Met Expectations” category. 
Comparison of performance levels/scores The students performance level improved from Grade 4 to Grade 5. 
Identifying the grovvth measure The students grovvth percentile in Grade 6 vvas 26. 

Interpreting the grovvth measure m 2017, the students grovvth vvas the same or better than 5296 of 


other grade 8 students. 
Comparison of grovvth measures The students grovvth score increased from 2015 to 2016. 


interpret the display as a starting point, as performance indicators are generally more familiar/ 
straightforvvard test results for many prospective audiences. These statements vvill not be sum- 
marized in our results section but are provided here for context. In part these vvere included so 
that both statements vvere included on more and less familiar reporting metrics, lest the par- 
ticipants become frustrated in responding to the grovvth statements (vvhich are generally less 
familiar displays) continuously. 
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The first display vvas a table-type display (Figure 4.1) that provided only information about 
SGPs and not on the students performance. "his display presented the grovvth of a student as 
vvell as the grovvth of the school and the district that the student belonged to, lending itself to 
the comparison of the individual students grovth to the average grovvth of his or her school and 
district, formatted as a table. 

"The second display vvas a line graph type display (Figure 4.1) that presented a students per- 
formance history over the past three years, using performance level categories and grovvth per- 
centile scores for each year. The grovvth percentiles for each of the three years vvere presented in 
a line graph, and the interpretation of the current year$ grovvth percentile vvas provided in text 
next to the graph, vvith the SGP in bold print. 

"The third display included a bar graph display of performance, but the only grovvth-related 
information vvas included as a text-based display of grovvth information at the bottom of the score 
report, in the form ofa single sentence (Figure 4.1). In general, this display focused primarily on the 
performance level of the student, both individually and vvithin the context of others at the school, 
district, state, and cross-state levels, vvith grovvth information appearing relatively minimized. 

"The fourth display vvas a bar graph type display that presented a students traiectory of achieve- 
ment from the past tvvo years to the current year in terms of scale scores and performance levels 
(Figure 4.1). The performance level of the student for each of the three years vvas presented both 
numerically and in a bar graph. Betvveen the current year and the one year previous, the SGP 
vvas provided, along vvith the categorization of that grovvth, as either High, Lovv, or Typical. 

"The fifth display vvas a profection type display that focused on the changes of a students 
achievement levels as a line graph, simultaneously using scale scores, and grovvth percentiles. 
"The values for the scale scores and the SGPs vvere presented in a tabular form belovr the graph, 
and there vvere no values provided on the graph itself. The category of grovvth (High, Lovv, Typ- 
ical) vvas provided in the table and in the graph through the use of color. Moreover, this display 
provided the profection of the students future performance by indicting the levels of grovvth 
relative to the performance levels that might be obtained in the next year. 

"The sixth display vvas a proyection type display, and vvas very similar to the fifth display, pre- 
senting the trayectory of the performance of an individual student as vvell as the proyection of 
his or her future performance. Hovvever, in this presentation, the Y-axis provided values for the 
scale score as vvell as presenting the scores in a tabular form. 

For each display, the number of statements of each type is provided in Table 4.2 to provide 
context. The number of statements of each type vvas dictated by the type of information pro- 
vided by the display. So, vvhile it vvas desirable to get all statements of all types for each display, 
some displays simply did not have the information to support those types of statements. This 
table should help provide some insight into the differences betvveen the type of information 
provided in each display. 


Table 4.2 Numher of questions of each type for each display. 


Displa Interpret Compare Identify Grovth Interpret Grovvth Compare Grovvth 
play p p Pp Pp 


Performance Performance 
1: Table 0 0 0 1 5 
2: Line 1 1 1 ı. 2 
3: Text 2 0 0 4 0 
4: Bar 1 1 0 3 1 
5: Proyection 1 1 1 1 2 
6: Proyection 1 1 1 1 2 
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Results 


Each of the six displays vvas presented to a group of adults online, vvho vvere asked to respond 
to some factual statements about the display (by agreeing or disagreeing). As noted previously, 
every attempt vvas made to present statements as parallel as possible across the six displays, 
although there vvas some variation across displays due to the nature of the information pre- 
sented in the respective displays. Statements vvere categorized in terms of cognitive level (iden- 
tify, interpret, compare) and content (performance or grovvth). The table-type display only had 
information related to grovvth, and no information related to performance. Although the pur- 
pose of this study vvas to learn about best practices in reporting grovvth, some simple statements 
inquiring about general performance vvere included since this type of information vvas typically 
more straightforvvard and easy to interpret. By including these statements, vve had some evi- 
dence as to vyhether the respondents could make the basic interpretations of the displays, and 
also to present statements on all aspects of the score report. This led to the five categories of 
items provided above. For each of these categories, the percent correct vvas computed for each 
display. The results are presented for the statements related to grovvth, as those are the state- 
ments that are relevant to this study. 

The statements related to grovvth vvere combined into the three categories of Identifying 
Grovvth, Interpreting Grovvth, and Comparing Grovvth: the results vvere computed for each 
category instead of for each individual statement. VVhen the Identifying Grovvth category vvas 
analyzed, there vvas data only for tvvo types of displays: line and profection. For the line type dis- 
play, the percent correct for this category vvas 7896 and for the proyection type it vvas 8796 (both 
proyection displays had exactly 8796). These differences vvere not statistically significant. For the 
Interpretation of Grovvth category, hovvever, there vvere greater differences. The line, text, and 
bar-type displays produced results that vvere significantly higher in accuracy than the profec- 
tion displays (px0.05). Since the proyection type had tvvo versions, it is interesting to note that 
there vvas considerable variability vvithin that display type. For example, for one proyection-type 
display, the accuracy vvas 6896 and for the other the accuracy vvas 5296. Given the complexity of 
interpreting the SGP, this is a very important finding: if the information vvas presented appro- 
priately, the user vvas able to understand it, hovvever, hovv that information vvas presented really 
did matter. This presentation vvas not yust based on the type of graphic used, but also the vvay the 
information vvas displayed on the graphic. 

In the category of Comparing Grovvth measures, respondents vvere asked to either compare 
different values of the SGP for a student across different years, or to compare the students value 
of the SGP to the school/district/state SGP. Tn either case, the respondent vvas asked to compare 
tvvo or more values of the SGP. The text type display did not contain information relevant to this 
category. Again, in this instance, there vvas a İot of variability among the displays vvith respect 
to hov difficult the respondents found this task. On average, the profection type displays vvere 
most successful in presenting this type of information, vvith the highest accuracy (9396) as com- 
pared to the bar display (6996) and line display (5296). The differences betvveen the proyection 
type displays and the other tvvo types vvas statistically significant (px0.05). VVithin the profec- 
tion type displays, there vvas some variability, although both displays had high rates of accuracy, 
8906 and 9796. 

In summary, the short study indicated that there vvere differences in the respondents” skills 
in interpreting the statements about several displays of grovvth results. Since participants in the 
study vvere yust adults in the general population, they vvere not stakeholders, and the results of 
this study might be a lovver bound on those that might be obtained in cases vvhere the respon- 
dents had a more acute interest in trying to understand the displays. Nonetheless, the study 
provided valuable information regarding hovk easily various displays of student grovvth could 
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be interpreted. As a general result, there vvas greater variability across the various displays vvhen 
the statements asked vvere of a higher order of thinking (Erterpret or Compare). Statements 
that required no real interpretation, such as identifying the performance level or grovth level 
of a student, vvere typically easy for all respondents, regardless of the display used. Tn contrast, 
vyhen asked to make interpretations, or compare measures, there vvas greater variability across 
the displays, indicating that some displays vvere more successful than others at communicating 
complex information. 


Impact of Demographic Groups 


VVhen the results vvere compared across demographic groups, there vvere no statistically signif- 
icant differences in the percent correct for any of the demographic groups. Specifically, there 
vvere no statistically significant differences in the percent correct depending on experience vvith 
statistics, gender, ethnicity, age or level of education. This lack of difference might be due in 
part to the small sample sizes in the groups. VVhen looked at descriptively, there vvere some 
differences noted, and these provide some credibility to the results obtained. For most of the 
categories of statements, no interesting differences vvere noted across groups, hovvever, for the 
interpreting grovvth category, some differences began to emerge. VVith respect to level of educa- 
tion, those vvith a professional degree (e.g. MD, or 7D), the accuracy vvas higher than for those 
vvith less education. For those vvith the professional degree, the accuracy vvas approximately 
9096, vvhile for the other levels of education, it vvas closer to 7096 vvith some categories slightly 
less than that. Similarly, those that took a statistics course at the graduate level vvere more accu- 
rate, approximately 8796, that those that took either no statistics courses, or statistics at the high 
school or undergraduate level, vrhere the accuracy vvas approximately 6706. 


Four Key Issues 


As noted previously, grovvth model results reporting is something ofa nevv frontier in reporting 
practices in K-12 testing, as these types of results have only started to become more vvidespread 
on reports in recent years, though the increase in use has been exponential. In reflecting on 
the displays available on various state education agency and testing company vvebsites (vvhich, 
though a non-scientific sample, offers some insight into state and local reporting practices), 
and considering the results of this small-scale study, it is clear that from a communication and 
reporting perspective, some important issues and considerations are emerging. 


1. Complexity of Results 


It is clear from the displays vve have found that vrhile the computation of many types of grovvth 
scores involves a series of statistical models and choices behind the scenes, and the mathematics 
of these approaches may not be readily accessible to many stakeholders (such as families and 
educators), it is possible to report the meaning of these scores in a vvay that can be understood 
by a lay audience. Not all of the approaches vvere successful in being able to communicate the 
meaning of the SGP, or hovr to interpret it, and as such, care must be taken in hovr that data are 
presented. Some specific features that are associated vvith greater understanding of the displays 
include: 


ə Clear definition ofvvho is in the norm group for the SGP: Tn cases vvhere that information 
vvas clearly presented, the respondents could correctly identify that the SGP reflected 
performance relative to some other group of students, not students in general. 
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ə Prominence in layout and design: If this information is provided in text, the text should 
be of a large enough font size to dravr attention to it and should be located centrally so 
that it is not ignored. 

ə Line graphs: In comparing the grovrth measures, the displays that utilized line graphs 
for comparing grovvth appeared to vvork better than other types of graphics, such as bar 
charts. 


It is important to recognize that grovvth scores are a different kind of score for many intended 
users of the data, and as such present communication challenge for testing agencies in inform- 
ing users about the meaning of grovvth scores as vvell as appropriate interpretations and use. 
Hovvever, principles and insights gained from research on reporting scaled scores, or perfor- 
mance categories, is relevant. No information vvas gathered as to hovr to take action based on 
these grovvth scores, and as such, it is difficult to conclude hovv deeply the respondents under- 
stood the information. Hovvever, the data collected here indicate a different trend from that in 
previous studies (Clauser, Keller, 8: McDermott, 2016) vvhere high school principals vvere not 
able to correctly interpret the SGP. This trend is encouraging that there are improvements being 
made in hovv this information is being communicated to the public. 


2. Need for Interpretive Materials 


Interpretative materials vvere not presented to the respondents in the small-scale study pre- 
sented here, but clearly vvould aid in the interpretation of grovvth measures. In revievving and 
surveying reporting strategies for grovvth reporting, it vvas evident that agencies responsible for 
reporting are at least in part avvare of the potential for these reporting efforts to be misinter- 
preted. Tvvo states, Havvaii and Virginia, have made explanatory/interpretive videos available 
on their state education vvebsites that explicitly focus on providing stakeholders vvith details 
on grovrth models. These videos use features such as cartoon figures and analogies (bus, high 
fump) to make the concepts accessible to vievvers. This is an innovative approach to reporting 
in general (and grovvth reporting specifically) that should be commended. Other states have 
released annotated report displays, illustrative guides, and PovverPoint presentations to pro- 
vide further details on the mechanics and appropriates uses of these data. These approaches are 
innovative strategies, and vve look forvvard to seeing further findings about report clarity and 
usefulness for users. Future research could, and should, focus on hovv these materials are used 
by stakeholders, and hovv their use affects the interpretation of score reports. These materials 
vvould be highly useful so long as the stakeholders are using them and applying the information 
correctly. Based on the study conducted here, interpretative materials should also provide guid- 
ance on hovr to compare different grovvth scores. These comparisons might be from subiect to 
subyect, from student to school/state/district or across years. These types of statements appeared 
to be especially problematic for the respondents in the study. 


3. Error Reporting 


Reporting error in test scores is an element of score reporting that has been implemented to 
varying degrees in different testing contexts, even for the relatively straightforvvard scale scores 
typically reported. Tt is also a topic receiving increased research attention (Zapata-Rivera, Kan- 
nan, and Zvick, this volume). Some states do provide standard error information in their cur- 
rent K-12 reporting efforts for scale scores, and others do not. VVhere information about score 
imprecision is not included, the reasoning behind that decision may be rooted in concerns about 
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adding complexity to reports and/or a belief that the presence of such information implies thata 
test is unreliable because error is associated vvith it. Hovvever, the Sfandards for Educational and 
Psychological Testing (AERA, APA, öz NCME, 2014) suggest that providing some information 
about the errors associated vvith test scores is necessary and responsible, and this guidance, 
in our opinion, should be extended to the statistical calculations of grovrth, vvhich themselves 
are extrapolated from test scores. None of the displays that vvere used in this study contained 
any reference to errors in the grovvth scores, or hovv stable the grovvth scores vvere. Adding this 
type of information vvould bring the typical practice in line vvith the Standards and should be 
included. Future research on hovv stakeholders interpret the errors vvould be necessary to deter- 
mine best practices around this reporting, although research conducted on reporting errors in 
test scores in general is germane. 


4. Report Development Processes 


Underlying all of the observations about grovvth reporting here is the idea that reports do mat- 
ter, and that report development should follovr a logical sequence of events that includes the 
solicitation of feedback from intended audiences. This argument has been advanced by Zeni- 
sky and Hambleton (2015): see also (Zenisky 8: Hambleton, 2012), in their proposed model 
for report development. As vvith all score reporting efforts, displays, and strategies for grovvth 
reporting should begin vvith a data-gathering phase including a statement of the purposes of any 
report along vvith the intended audiences, follovved by report preparation, tryout vvith intended 
users, and a final phase of monitoring and improvement. 


Summary 


This brief chapter does not end vvith conclusions, because the conversation about grovvth model 
reporting is only beginning. As the use of grovvth model results for multiple audiences and pur- 
poses is increasing (for individuals, for groups of students, and for reporting of teacher quality), 
the importance of hovv these data are communicated vill only increase. Rather, vve raise the 
follovving broad questions, as a call for more research on hovr grovrth model results are used 
and understood. 


ə Hovr can grovrth score results vvith their added complexity over single occasion test score 
be displayed so as to be readily understood and used by relevant audiences? 

ə Hovv can errors in grovvth scores be communicated? 

ə Hovv should grovvth scores be used by stakeholders? Are there specific actions that stake- 
holders should take based on these grovvth scores? These questions might help to clarify 
hovv they should be reported. 

ə Can tools be developed for interested stakeholders to plug in the information from their 
scores reports to get actionable information? 

ə Hovr much do stakeholders use interpretative materials? VVhile this applies to score 
reporting in general, given the additional complexity of grovvth measures, it might be 
even more relevant here. Are there vvays to make interpretative materials more accessible? 


There are many more questions that can be imagined, and vve further note that the displays 
used in the study here are not representative of all of the approaches being used in the states. 
Accordingiy, this topic remains an important direction for continued vvork on the reporting of 
grovvth. 
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Communicating Measurement Error 
Information to Teachers and Parents 


Diego Zapata-Rivera, Priya Kannan, and Rebecca Zvvick 


Clearly communicating assessment results to the intended users so they can make appropriate 
use of this information is a central issue for assessment validity (Kane, 2013, Tannenbaum, 
this volume). The Sfandards for Educational and Psychological Testing (American Educational 
Research Association, American Psychological Association, 6: National Council on Measure- 
ment in Education, 2014) contains several guidelines on score reporting issues, including the 
need to provide interpretations of assessment information that are appropriate for the intended 
audience, evidence to support interpretations for intended purposes, information about recom- 
mended uses, and vvarnings about possible misuses. These standards address the responsibilities 
of test-developers in appropriately communicating assessment results, and the rights of test-us- 
ers and test takers to understand and make use of this information appropriately. 

Research on score reporting has produced some guidelines and iterative framevvorks for 
designing score reports that meet the Standards (Goodman 6: Hambleton, 2004, Hambleton 8: 
Zenisky, 2013, Hattie, 2009, Undervvood, Zapata-Rivera, 6: VanVVinkle, 2007, VVainer, 2014, 
VVainer, Hambleton, $ç Meara, 1999, Zapata-Rivera, VanVVinkle, 8: Zvvick, 2012, Zenisky 8z 
Hambleton, 2016). These guiding principles usually involve steps for evaluating score reports 
vvith the intended audience. Additional information about some of these framevvorks can be 
found in Brovvn, O”Leary, and Hattie (this volume), O”Donnell and Sireci (this volume), and 
Tannenbaum (this volume). 

Because the characteristics of score reporting audiences vary, different approaches to facili- 
tating comprehension of assessment information for the intended audience have been explored. 
Zapata-Rivera and Katz (2014) suggest focusing on the needis, knovvledge, and attitudes of the 
audience as an important step in the process of designing interactive score report components. 
These components may include the use of graphical representations, vvritten explanations, 
examples, on-demand help, video tutorials designed to address misconceptions, and navigation 
approaches. 

Given the importance of information about the precision ofthe test scores (i.e., measurement 
error) to inform decision making, the Sfandards include language on the need for providing 
interpretations on vhat the scores represent, their precision, and hovv they are intended to be 
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used. Prior vvork has shovn that communicating measurement error information vyith test-us- 
ers and test takers can be challenging (Hambleton 8: Slater, 1997, Kannan, Zapata-Rivera 8: 
Leibovritz, in press, Lukin, Bandalos, Eckhout, 6: Mickelson, 2004: Zapata-Rivera, VanVVin- 
kle, 6: Zvvick, 2010, Zapata-Rivera 6: Zvvick, 2011, Zvvick etal., 2008). Zvvick, Zapata-Rivera, and 
Hegarty (2014) described user difliculties in comprehending verbal and graphical representa- 
tions of measurement error. There is no one-size-fits-all visual representation that is uniformly 
understood by all users, and no universal line of text that clearly conveys vvhat standard errors 
are and vihy they are important (Zenisky 6: Hambleton, 2016). 

In this chapter, vve examine research on designing and evaluating score reports for teachers 
and parents. In particular, vve focus on issues regarding the communication of measurement 
error information vvith these audiences. Finally, challenges and opportunities for continuing 
research in this area are identifted and discussed. 


Teachers and Parents as Tvvo Different Audiences 


Even though both teachers and parents vvant to receive information about students” test per- 
formance, their needs and uses of assessment information may differ. For example, teachers 
may be interested in both classroom and individual-level information vvhile parents are likely 
to be interested only in their individual childs performance. Teachers may use the assessment 
results to inform grouping and instructional planning, vvhile parents may use this information 
to support their conversations vvith their childs teacher and to obtain appropriate support for 
their child. In the past, score reports for parents vvere sometimes regarded as simpler versions 
of those produced for teachers. VVhile reports resulting from this approach may have provided 
parents vvith test-relevant information, these reports vvere not necessarily designed vvith par- 
ents” needs and levels of understanding in mind (Barber, Paris, Evans, 6: Gadsden, 1992). 


Needs, Knovvledge, and Attitudes of Teachers 


VVork on applying audience analyses to the design of score reports for teachers has considered 
their needs for assessment information, their prior knovvledge about assessment, and their atti- 
tudes tovvard score report information (Zapata-Rivera 8: Katz, 2014). Tn particular, teachers 
need information that can be used to guide instruction (Undervvood, Zapata-Rivera, 6: Van- 
VVinkle, 2007). This requirement has been referred to as “vrho needs to be taught vvhat next” 
(Brovvn, O”Leary, 6: Hattie, this volume). The questions teachers are interested in including: 
Hovv did the class perform on the test? VVhat are my students” strengths and vveaknesses? Hovv 
does a particular students score compare to other students” scores? Hov difficult vvere the tasks 
for students? And vvhat should 1 do next to help an individual student or the class as a vvhole? 
In general, cognitive aspects of the user such as perception, attention, vvorking memory and 
prior knovvledge play an important role on users” comprehension of graphical representations 
(Hegarty, this volume). In the field of score reporting, vve have found that after reading interpre- 
tive materials, teachers usually have the knovvledge required to understand most of the informa- 
tion typically included in score reports (e.g., scores, score means, percentiles) (Zapata-Rivera, 
VanVVinkle, 6: Zvvick, 2012). Hovvever, the use of technical language may interfere vvith proper 
understanding of score report information (Hambleton $£ Slater, 1997, Undervvood, Zapata-Ri- 
vera, 6: VanVVinkle, 2007, Zapata-Rivera, VanVVinkle, 6: Zvvick, 2012). In addition, teachers 
may have limited knovvledge of the concept of measurement error and hovr to use it to inform 
their decisions (Zvvick, Zapata-Rivera, 6z Hegarty, 2014, Zapata-Rivera, Zvvick, 8: Vezzu, 2016). 
In terms of attitudes, teachers may value clear and direct ansvvers to their assessment ques- 
tions, since many of them have limited time to explore assessment results (Zapata-Rivera, 
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Hansen, Shute, Undervvood, 6: Baver, 2007) and may place too much trust in the precision 
of scores (Zapata-Rivera, VanVVinkle, 6: Zvvick, 2012, Zvvick et al., 2008). Information about 
teacher attitudes tovvard assessment in general can be found iin Goertz, Olah, and Riggan (2009), 
Mandinach and Gummer (2016), and Marshall and Drummond (2006). 


Needs, Knovvledge, and Attitudes of Parents 


Parents and guardians are keen to understand hovr their child has performed on a test, vhat the 
scores mean, and vhat they can do to help support their child to improve performance in the 
future (Kannan, Zapata-Rivera, 6: Leibovvitz, in press). In order to support parents” interpreta- 
tions and uses of the information conveyed in score reports, it is important that their unique 
needs, pre-existing knovvledge, and attitudes are taken into consideration. 

Research on identifying parents” needs have found that, across the board, parents are most 
interested in understanding vvhat each score on the report means, hov their child performed 
against set standards, their childs performance level (e.g., basic, proficient, and advanced) and 
the implications of that placement (A-Plus Communications, 1999, Kannan, Zapata-Rivera, 6: 
Leibovvitz, in press: Munk 6: Bursuck, 2001, NEGP, 1998). Beyond that, research evaluating 
score reports vvith a diverse subgroups of parents (Kannan, Zapata-Rivera, 6: Leibovvitz, in 
press) has found that vvhat parents list as their second-most important need varies across demo- 
graphic subgroups. VVhile parents vvith a college degree vvanted to see hovv their children vvere 
performing in relation to other students (i.e., normative comparisons) in their ovrn and other 
schools vvithin their district or state (A-Plus Communications, 1999, Kannan, Zapata-Rivera 6z 
Leibovvitz, in press), parents vvith no college degree listed areas that needed improvement and 
vvays to help their child as the second-most important information (Kannan, Zapata-Rivera, 6z 
Leibovvitz, in press). 

VVith regard to comprehension of information presented in score reports, parents typically 
do not have the assessment-related and technical background required to appropriately inter- 
pret and use the test results presented for their child (Barber et al,, 1992). Parents struggle 
to fully understand some of the information that is typically included in individual student 
reports (1SRs). These include information about their childs performance in subareas, their 
childs grovvth across the years, and the measurement error involved in their childs scaled score 
(Kannan, Zapata-Rivera, 8: Leibovvitz, in press: Rick et al,, 2017). These pieces of information 
have ranked much İovver on parents” listed needs (A-Plus Communications, 1999, Kannan, 
Zapata-Rivera 6: Leibovritz, in press: Munk 8: Bursuck, 2001, NEGP, 1998) vvhen compared to 
information about their childs overall score and performance level placement. 

In terms of attitudes, Barber et al. (1992) found that only 5396 of the 105 parents surveyed 
thought that the assessment contributed to their childs education. A recent poll by Phi Delta 
Kappa (PDK) and Gallup (2015) shovved that even though 7196 of public-school parents across 
the country felt that using tests to measure vvhat students have learned vvas important for 
improving public schools in their community, 6796 of respondents felt that there vvas too much 
emphasis on testing in their childrens schools—a perception that has remained somevvhat per- 
sistent through the last couple of decades (see A-Plus Communications, 1999, Bennett, 2016, 
PDK $: Gallup, 2015). Additional information about parent attitudes about assessment can be 
found in Harris and Brovn (2016). 


Score Reports for Teachers and Parents 


In this section, vve describe the types of score reports that are usually designed for tvvo key 
audiences—teachers and parents. 
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Common Features of Score Reports for Teachers 


Score reports for teachers usually include classroom-, individual-, and task/item-level score 
reports. These reports can be available as online interactive reports or printed documents. As an 
example ofthe type of information included in teacher score reports, vve describe the score report 
prototypes that vvere designed and evaluated as part of the Cognitively Based Assessment Of, for, 
and as Learning (CBAL”") research initiative (Bennett 6: Gitomer, 2009). Even though this section 
focuses on teacher score reports used as part of the CBAL research initiative, most operational 
score reports for teachers include a subset of the score report elements presented here. More 
information about these reports can be found in Zapata-Rivera, VanVVinkle, 6 Zvvick (2012). 


ə İndividual-level reports provide information to ansvver the follovying questions: Hovv did 
student X do on the test? Hov, did students in the same grade do on the test? And vvhat should 
be done next? Information in the CBAL individual report includes a general description of 
the test and the sections of the report, a personal identification section vvith information 
such as student name, grade, teacher name, subiect, and test dateş a section vvith appro- 
priate and inappropriate uses of the assessment information presented in this report, a 
performance summary section vvith information such as scaled scores, confidence bands, 
performance levels, a distribution of scores for students in the same grade, and additional 
materials (e.g., explanations of statistical terms used in the report): a current test perfor- 
mance section that includes information such as rayv scores and links to relevant informa- 
tion (e.g., skill definitions, sample tasks, and explanations of statistical terms used in the 
report): and a “VVhat to do Next?” section that provides a summary of hovr the student did, 
information about the next performance level and recommendations for teacher follovv-up. 

ə Classroom-level reports respond to the question, hoyr did students in my classroom per- 
form on the test? "he CBAL classroom report includes an introduction, a section vvith 
appropriate and inappropriate uses of this report and classroom performance informa- 
tion presented as a sortable table including score and performance level information for 
each student in the class and a graph shovving the score distribution of the class among 
performance levels. This report also includes links to explanations and definitions of 
some terms. 

ə Task/item-level reports provide information to ansvver the question, hov” did my students 
do on this task? The CBAL item-level reports include an introduction section, a section 
vvith appropriate and inappropriate uses, and a table vvith item/task difficulty informa- 
tion. In addition, each task/item is linked to information about related content and pro- 
cess skills. 


Several aspects of the report (e.g., navigation, additional explanations, and lists of appropriate 
and inappropriate uses) are organized according to the needs and attitudes of this audience in 
order to facilitate finding the information needed to inform decisions and minimize opportuni- 
ties for misuse. Even though these teacher reports vvere designed to provide score report informa- 
tion after each of the interim assessments that comprised a larger assessment system, CBAL has 
developed other reporting systems for formative purposes that can serve as learning tools allovv- 
ing teachers to assign tasks to students and provide immediate feedback (Zapata-Rivera, 2011). 


Common Features of Score Reborts for Parents 


Score reports for parents and guardians are alvvays intended to provide results for their individ- 
ual child—these reports can either be based on formative or summative assessments (Kannan, 


Communicating Measurement Error Information ə 67 


Bryant, Zapata-Rivera, 6: Peters, 2017). Results presented to parents may be evaluative and 
present summary information about their childs performance on a standardized assessment 
at the end of a learning period. Alternatively, the results provided may be intended to support 
decisions about placement on advanced classes or remediation for their children (Kannan, Bry- 
ant, Zapata-Rivera, 8: Peters, 2017). Finally, though score reports designed for parents have 
traditionally been static, printed score reports, they could also be interactive score reports vvith 
layers of drillable information to accommodate the varied needs of this extremely diverse stake- 
holder group vho vary in education level, language proficiency, socio-economic background, 
and an array of other variables (Kannan, Zapata-Rivera 8: Leibovvitz, in press). 

Overall, parent score reports based on summative or end-of-year assessments are intended 
to ansvver the follovving questions: Hov did my child do on the test? Hovv did other students in the 
same grade (in his school, and in the state and district) do on the test? VVhat are my childs general 
strengths and vveaknesses? Has my child shov”n any improvement since last year? Hov: can 1 help 
my child or vvhere can 1 get more help? Parent score reports based on formative assessments are 
typically designed to ansvver the questions such as: V/hat are my childs specific strengths and 
vveaknesses? Does my child have any specific vveaknesses that vve should vvork on vyith the teacher? 
VVhere can 1 get more help to support my childs grovvth in this area? 

Research (e.g., Goodman 6: Hambleton, 2004, Hambleton 8: Zenisky, 2013, Kannan, Zapa- 
ta-Rivera, 6 Leibovvitz, in press: NEGP, 1998, Zapata-Rivera 8: VanVVinkle, 2010, Zenisky 6z 
Hambleton, 2012) has suggested that score reports designed for parents should include a range 
of test- and performance-related information, such as (a) a description of the purpose of the 
test, (b) a snapshot or at-a-glance summary in the beginning, (c) a personal identification sec- 
tion vvith information such as student name, grade, teacher name, subiect, and test date, (d) 
cues, hints, definitions, and extended descriptions vvhen technical language is used: (e) scores 
represented as graphics vvith colored bars, so parents find it easy to make comparisons: (f) 
norm-referenced comparative information for other students in the same grade (vvithin the 
school, district, and state): (g) a description of the nature and precision of scale scores in an 
unambiguous manner (i.e., measurement error): (h) the students performance across the sub- 
areas tested, vvith a detailed description of each subarea, (i) a listing of the types of knovvledge 
and skills (as vvell as examples of items) that the student has mastered and currently struggles 
vvith, and (/) information about next steps, and vvhere to get additional help for their child. 
Finally, it has also been recommended (Goodman 8: Hambleton, 2004: Kannan, Zapata-Ri- 
vera, 6 Leibovvitz, in press) that significant efforts should be made to limit overall use of tech- 
nical language by simplifying and streamlining the text. If possible, score reports for parents 
should be made available in multiple languages. 


Communicating İnformation About Measurement Error 


Representing and communicating uncertainty is a topic of interest in in several disciplines 
(Correll $ç Gleicher, 2014, Demmans Epp 8: Bull, 2015, Hopster-den Otter, Muilenburg, VVools, 
Veldkamp, 6: Eggen, 2018, Ibrekk 8: Morgan, 1987, Spiegelhalter, Pearson, 8: Short, 2011). 
Clearly communicating uncertainty is important in order to support evidence-based decision 
making. An understanding of the level of uncertainty of a particular event can play an import- 
ant role vhen making decisions based on scientific data (e.g., deciding vvhether to evacuate 
before a hurricane or comparing medical treatments: Fischhoff 8: Davis, 2014). 

In educational assessment, appropriate communication of uncertainty is particularİy 
important vvhen educational decisions are to be made on the basis of test scores. Measurement 
error information associated vvith the test scores can provide the knovvledge test-users need to 
make a particular decision. For example, a test user may vvonder vvhether tvvo test scores are 
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meaningfully different. Although, in some cases measurement error information is not pre- 
sented due to possible misinterpretations, educating the general public on hovr to understand 
and use this information may increase transparency and confidence (Nevton, 2005). 


Communicating Information About Measurement Error to Teachers 


In this section, vve describe tvvo studies in vvhich researchers investigated vvays of communi- 
cating measurement error information to teachers. Both of these studies are an exploration of 
comprehension and preference aspects of the intended audience and used a variety of research 
methods (e.g., usability studies, pilot studies, and large-scale studies). Both comprehension and 
preference data provide important insights on the types of misconceptions teachers have and 
the types of supports that can be put in place to help them understand and use measurement 
error information appropriately. 

Zvvick, Zapata-Rivera, and Hegarty (2014) explored the use of verbal and graphical repre- 
sentations of measurement error intended to help teachers understand and make appropri- 
ate decisions based on test-score information. The research participants, 148 teachers and 98 
introductory psychology students, vvere asked to vievv score reports that included varying rep- 
resentations and descriptions of measurement error. Verbal descriptions included analogies 
betvveen measurement error in test scores and error in measuring vveight and blood pressure. 
Graphical representations of measurement error included a standard error bar and a tapered 
confidence band (see Figure 5.1). Participants vvere randomly assigned to one of four conditions 
(tvvo verbal descriptions crossed vvith tvvo graphical representations). As suggested in the litera- 
ture on score reporting, participants vvere asked both preference and comprehension questions 
(VVainer, Hambleton, $: Meara, 1999, Zenisky 6: Hambleton, 2012). Results shovved that partici- 
pants vvho reported greater comfort vvith statistics tended to have higher comprehension scores 
and to prefer the tapered confidence band. Several misconceptions about measurement error 
vvere identifled such as the belief that test scores are perfectly precise or that the level of certainty 
vyas constant across the confidence band. Some participants assumed that confidence bands for 
test scores must be based on multiple observed scores from a single individual or from a group 
of test takers. Participants mentioned the need for explanations on the meaning of confidence 
bands and the use of information that could be used to support decision making (information 
that vvas intentionally omitted in the study). 

A follovv-up study exploring the effectiveness of a short, vveb-based tutorial in helping teach- 
ers to better understand the measurement error information in test-score reports vvas carried 
out (Zapata-Rivera, Zvvick, 8z Vezzu, 2016). Participants vvere 145 K-12 teachers across a variety 
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Figure 5.1 A standard error bar and a tapered confidence band. 
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of subfect areas including mathematics, English language arts, science, and foreign language. 
Tvvo short video tutorials vvere created. The basic tutorial included simple definitions, examples 
of the causes of measurement error, illustrations of confidence bands, and explanations of hovv 
to interpret them. The enhanced version included additional screens shovving hovv a confidence 
band is obtained. Results shovved that participants vvho vvere assigned to the tutorial conditions 
(basic and enhanced) significantly outperformed those assigned to the control condition (no 
tutorial) in the comprehension questionnaire. The proportion of variance in comprehension 
scores that vyas attributable to experimental condition (eta-squared) vvas .23, but the difference 
betvveen the tvvo tutorial conditions vvas not statistically significant. Results of a usability ques- 
tionnaire administered to those participants in the tutorial conditions shovved that most vvould 
like to use this type of tutorial in the future and found the tutorial useful (9696), easy to under- 
stand (9396), and engaging (9096). The mafority believed they learned a lot from it (8596) and 
reported that they understood vvhat a confidence band represents (9706). These results shovved 
that potential of instructional materials like these to provide teachers vvith clear information 
that helps them understand score report information and use it in appropriate vvays. 

Lessons learned in terms of the research methodology employed vvith teachers vvere applied 
to a different audience, parents. The next section discusses vvork on communicating measure- 
ment error information vvith parents. 


Communicating Information About Measurement Error to Parents 


VVhether it is useful to include information about measurement error (or score precision) in 
ISRs primarily intended for parents has been a controversial issue. In particular, the Sftandards 
(AERA, APA, 8: NCME, 2014) and several researchers (e.g., Faulkner-Bond, Shin, VVang, 6z 
Zenisky, 2013, Zapata-Rivera, Zvvick, 8: Vezzu, 2016) have specifically recommended that a 
description of the nature and precision of scale scores be presented in an unambiguous manner 
in ISRs. Although VVainer, Hambleton, and Meara (1999) have recommended that it is best to 
omit information about measurement error on ISRs unless this information can be presented in 
a vvay that leads to accurate interpretations and appropriate uses. 

Until recently, there vvas very little evidence in the research literature about the steps taken 
to explain or quantify error and uncertainty vvhen reporting test results (to any stakeholder 
group). In their survey of international score reports, Bradshavv and VVheater (2009) found that 
vvhile descriptive information (e.g., overall score, grades) vvas easy to find on most reports, it 
vvas almost impossible to find any explanations of reliability or measurement error in most of 
the reports they revievved. Hovvever, more recently there has been increasing research in hovv 
to represent and communicate measurement error information vvith parents (Kannan, Zapa- 
ta-Rivera, 6: Leibovvitz, in press, Kannan, Bryant, Zapata-Rivera, 8: Peters, 2017, Zapata-Rivera, 
Vezzu, öt Biggers, 2013, Zapata-Rivera et al., 2014). 

In practice, there has been vast variation across states in the amount and nature of informa- 
tion about score precision that is provided in ISRs for standardized assessments. Several states 
do not provide information about error or precision of scores in their score reports. For example, 
after revievving score reports for 41 states, Faulkner-Bond et al. (2013) found that only tvvo states 
provided information about measurement error for their English Language Proficiency (ELP) 
assessments. Other states that do provide information about measurement error typically do not 
provide a clear explanatory text. More recently, ISRs designed for parents have started to include 
information about measurement error. Hovvever, studies evaluating the interpretation and use 
of this information by parents, as a diverse and heterogeneous stakeholder group, are minimal. 

Consistent vvith the suggestion to understand stakeholder needs (e.g., Hambleton $: Zeni- 
sky, 2013), vve have conducted studies at ETS (Kannan, Zapata-Rivera, $x Leibovritz, in press: 
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Kannan, Bryant, Zapata-Rivera, 6z Peters, 2017) to understand hovv to appropriately commu- 
nicate measurement error information to parents. Similar to the studies vvith teachers (Zvvick, 
Zapata-Rivera, 6z Hegarty, 2014, Zapata-Rivera, Zvvick, 8: Vezzu, 2016), the studies vvith parents 
involved the use of both comprehension and preference questions and a variety of research 
methods (e.g., cognitive laboratories and usability and experimental studies comparing various 
graphical and verbal representations). 

In an early study vvith parents, Kannan, Zapata-Rivera, and Leibovvitz, (in press) used cogni- 
tive laboratories to investigate the extent to vrhich potential users could understand and inter- 
pret various pieces of information (including measurement error) presented in a hypothetical 
ISR. Participants vvere 35 parents from diverse subgroups (disaggregated by education level 
and English language proficiency). Results from that study suggested that parents across four 
subgroups defined by education level (i.e., those vvith and vvithout a college degree) and English 
language proficiency, struggled to understand the information presented about measurement 
error in particular. Even though parents vvere allovved to refer to the hypothetical score report 
in ansvvering the comprehension questions, about 5096 or more of the parents in each sub- 
group, even those vvith college degrees, vvere not able to accurately read back or reiterate the 
information presented about score precision. Parents in this study pointed out that even though 
the information about “precision/error” vvas understandable and perhaps even useful, it could 
be confusing and overvvhelming to parents in general, and that they vvere not sure if parents 
should/vvould care about this information. 

Therefore, in a follovv-up study, Kannan, Bryant, Zapata-Rivera, and Peters (2017) used a 
betvveen-subiects experimental design to evaluate parents” comprehension of measurement error 
information. Specifically, they sought to determine vvhether parents understood information 
about measurement error and vvhether they vrould find this information useful in making appro- 
priate inferences about their childs performance. 196 parents of middle school children vvere 
randomly assigned to three conditions in an online experiment: (a) a condition vvhere no error 
information vvas presented: (b) a condition vvhere measurement error vvas presented graphically 
vvith a bar around the score (one standard error of measurement above and belovv the observed 
score) and a standard footnote typically used in state standardized assessment reports: and (c) 
a condition vvhere measurement error vvas presented graphically as in the previous condition 
but vvith a more detailed (enhanced) footnote describing the various factors that could affect a 
child score on any given test administration. The researchers did not, hovvever, describe hovv 
these error bars are computed (or the percent confidence) to the participants in any of our study 
conditions. Once participants ansvvered the comprehension questions, they had the opportunity 
to examine all three different representations of measurement error and indicate vvhich repre- 
sentation they vvould prefer included in their child” score report. Results from this study suggest 
that parents are highly receptive to information about measurement error, and that that betvveen 
5896 and 7996 of parents across all three betvveen-subyect study conditions (irrespective of the 
type of information they received during the rest of the study) preferred the representation vvith 
the most information (i.e., the enhanced error representation). Moreover, vyhen provided vvith 
a detailed vvritten explanation, parents vvere more likely to understand this information (vvith 
higher overall comprehension scores) than vvhen such explanation vvas not provided. 

"The results from the second parent study (Kannan, Bryant, Zapata-Rivera, 8: Peters, 2017) 
can be interpreted as evidence that parents are not only trying to understand the informa- 
tion presented about measurement error (indicated by the higher comprehension scores for 
the “standard” and the “enhanced” conditions), but also vvant to try and use this information to 
better understand their childs performance on standardized assessments. Overall, from these 
results, vve glean that it is important to provide clear and detailed information to parents so that 
they are able to easily understand this information. 
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Discussion 


Here vve offer a discussion of several important themes that have emerged from the score 
reporting research. 


Sharing Research Methods and Materials 


Even though each audience has its ovvn characteristics, it is possible to use similar method- 
ological approaches (e.g., usability studies, cognitive laboratories, focus groups, intervievvs, and 
controlled studies) and data collection materials (e.g., preference and comprehension question- 
naires) in studying them. In some cases, insights gained from doing research vvith one audience 
can inform future vvork vvith other audiences. For example, teachers may suggest ideas about 
materials they have used or vvould like to use to share assessment information vvith parents or 
students. 


Preference and Comprehension 


VVhen creating score reports for different audiences, it is important to consider both preference 
and comprehension issues since it is possible that the preferred display may not be the best- 
understood one. In terms of interactive, computer-based reports, usability studies and cognitive 
laboratories provide interesting information about issues that may hinder the interaction. Also, 
these research approaches provide useful data on the cognitive processes that users exhibit vvhen 
trying to understand the assessment results provided in the report. These studies also provide 
an opportunity to pilot-test data collection materials that vvill be used in large-scale studies. 


More Research Needed 


More research is needed in the area of score reporting. Score reports may include “legacy” 
score report elements that do not clearly communicate assessment information to particular 
audiences. Research on the effectiveness of particular score report elements to communicate 
assessment information should be conducted. 

Other potential areas of research include exploring the trade-off betvveen achieving simplic- 
ity of reports (e.g., by hiding information that might result in misinterpretations) and using 
instructional and training materials, such as video tutorials, to facilitate understanding of 
important assessment information. More research on exploring the effectiveness of instruc- 
tional materials for teaching different assessment concepts to different audiences is also needed. 


Summary 


Research on score reports involves exploring hovv to present assessment information to dif- 
ferent audiences. VVhen designing and evaluating score reports and additional materials, it is 
important to take into account the needs, knovrledge, and attitudes of the audience and to pay 
attention to both preference and comprehension issues to capture a complete picture of the 
benefits and dravrbacks of the score report elements being studied. 

The vvork on communicating measurement error information vvith teachers and parents pro- 
vides a good use case vvhere the characteristics of the audience have been taken into account 
to design different types of graphical representations and supporting materials. Although the 
final reports for parents and teachers may look completely different, this vvork shovs that it is 
possible to apply similar research methods and materials vvith different audiences. 
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Score Reporting Issues for Licensure, 
Certification, and Admissions Programs 


Francis O”Donnell and Stephen G. Sireci 


Testing the knovvledge, skills, and abilities of people has a long history, almostas long as recorded 
history itself. For example, the story of Adam and Eve in the book of Genesis in the Bible could 
be considered the earliest recorded “test” Later, in the same book, there is a second reference to 
a test vvhen God “tested” Abraham by asking him to sacrifice his son Isaac (around 2100 BC). 
VVith respect to large-scale testing, the Book of yudges (12:4-6, around 1400 BC) describes the 
one-item “test” developed by Gileadites to identify the enemy Ephraimites vvho vere hiding 
among them (the test vvas to pronounce the vvord “shibboleth”). Although these events are of 
historical interest in considering vrhen testing first occurred, there are no formal records of the 
results of these performance assessments. 

"The focus of this chapter is on the results of testing —score reports—for licensure, certifica- 
tion, and admissions testing programs. VVe vvill focus on current score reporting practices, and 
so vve vvill not attempt to trace the history of score reporting practices to their origin. Hovvever, 
our revievv of these practices suggests that the first recorded results of credentialing testing 
may have literally been carved in stone. For example, “score reports” from the Le Thanh Tong 
dynasty in Vietnam (1484 AD) can still be seen today in the Temple of Literature in Hanoi. All 
candidates vrho passed the rigorous steps to be selected to vvork in the central imperial govern- 
ment under the Emperor had their names and hometovvn carved into huge steles in the shape of 
a turtle, vrhich emphasized their longevity and vvisdom. According to Nguyen (2009), this feu- 
dal examination model originated from the testing approach used in China for civil and military 
testing, and there are similar steles in the Temple of Literature in Beifing. 

Today, score reports are not carved in stone. İn fact, in many instances they are not even 
printed on paper. Instead, many testing programs in credentialing and admissions testing pro- 
grams provide score reports in digital formats via a URL that examinees, parents, and other 
stakeholders access and interact vvith to acquire various levels of detail regarding their perfor- 
mance on a test. In this chapter, vve describe current score reporting practices in credential- 
ing and admissions testing, discuss some of the practical issues and validity issues involved in 
reporting test results in these areas, and provide suggestions for future research and practice. 


78 ə Erancis O”Donnell and Stephen G. Sireci 


Before beginning our reviev, it is important to define the terms vve use for different testing 
contexts. Admissions tests refer to tests that have the primary purpose of providing informa- 
tion to those vvho make admissions decisions at various schools, such as selective high schools, 
colleges, universities, and postsecondary schools (e.g., medical schools, lavv schools, business 
schools, other graduate programs). Liceyisure tests refer to exams developed as part of a profes- 
sional licensure requirement that is needed for practice vvithin a profession. Examples include 
the Uniform Certified Public Accountants Exam for accountants, the National Bar Exam for 
lavvyers, the United States Medical Licensure Exam for medical doctors, and the National Coun- 
cil Licensure Examination for nurses. Cerrification tests refer to tests used to avvard certificates 
to candidates to certify competence or excellence, independent of a İlicensure requirement. 
Examples of certification exams include those used in the technology industry to certify com- 
petence in vvorking vvith hardvvare or softvvare (e.g., Microsoft, Cisco, Hevvlett Packard exams), 
career and technical education (e.g., automotive, culinary exams), and accomplished teaching 
beyond the licensure stage (e.g., National Board of Professional Teaching Standards). Because 
the issues and practices in licensure and certification testing are so similar, the more general 
term credentialing testing can be used to describe both contexts. 


Current Practices in Score Reporting in Credentialing and Admissions Testing 


There is great variety in the content and design of score reports for credentialing and admis- 
sions tests. VVithin the same field, some reports use a simple “letter” format, vrhile others consist 
entirely of tables vvith numbers. To describe current practices in score reporting for admis- 
sions and credentialing tests, vve conducted online searches of score reports from admissions 
testing and licensure testing programs, and vve contacted 60 certification programs from a list 
of organizations accredited by the National Commission for Certifying Agencies (Institute for 
Credentialing Excellence, 2017). 

Through these efforts vve vvere able to locate score reports from 38 testing programs: 22 cer- 
tification programs, eight admissions programs, and eight licensure programs. VVe did not use 
random sampling to select these score reports and so they cannot be considered representative 
of these areas. Nevertheless, they illustrate a vvide variety of examples of current score reporting 
practices across multiple domains. The 22 certification score reports came from programs in 
nursing specialty areas (five reports), fitness and exercise (three reports), pharmacy specialties 
(tvvo reports), occupational therapy and rehabilitation (tvvo reports), other health-related spe- 
cialties (seven reports), culinary arts (one report), real estate management (one report), and 
safety (one report). The eight reports for admissions tests came from programs designed to 
inform admissions into professional school (three reports), middle school (tvvo reports), under- 
graduate colleges and universities (tvvo reports), and graduate school (one report). Lastly, the 
eight licensure reports represented programs supporting licensure in financial services (three 
reports), teaching (three reports), and health sciences (tvvo reports). 

In the next section vve summarize the features of those reports and the types of information 
presented for each context. The reports typically provide information pertaining to examinees” 
performance on the exam overall, their performance in specific subdomains, and general inter- 
pretive guidance. A summary of the elements included in these reports, stratified by testing 
context, is presented in Table 6.1. Although there are some similarities in the types of informa- 
tion presented, there are also notable difflerences across these testing contexts. For the certifi- 
cation context, vrhere vve had the most responses, different types of information vvere reported 
depending on vvhether the candidates passed or failed the exam, so separate results for “pass” 
and “fail” reports are presented. VVe begin vvith a summary of the information provided in the 
reports from admissions testing programs. 
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Table 6.1 Types of İnformation İncluded in 38 revlevved score reports. 


Testing Context 
Information Provided Admissions Certification Licensure 
(n-8) Pass (zı — 22) Fail (nı — 22) (n-8) 
Overall Results 
Numerical score 8 (10006) 13 (5996) 19 (8606) 6 (7596) 
Performance levels" 2 (2596) 22 (10006) 22 (10006) 8 (10006) 
Performance level descriptions 2 (2596) 0 (—) 0 (—) 0 (—) 
Information about precision 5 (6396) 0 (—) 0 (—) 1 (1306) 
Visual display 7 (8896) 4 (1806) 7 (32906 2 (2596) 
Subdomain Results 
Numerical score(s) 4 (5096) 7 (3296) 15 (6896 1 (1306) 
Performance levels 1 (1306) 1 (596) 8 (36906 6 (7596) 
Performance level descriptions 1 (1306) 0 (—) 3 (1406 5 (6396) 
Information about precision 1 (1306) 2 (996) 6 (2796 3 (3896) 
Visual display 3 (3896) 7 (3206) 19 (8696 6 (7596) 
Interpretive Information 
Statement of test purpose 1 (1306) 0 (—) 0 (—) 2 (2596) 
Guidance on next steps 5 (6396) 16 (7396) 19 (8696 3 (3896) 
Details about vvrhere to find additional 6 (7596) 7 (3206) 15 (6896 6 (7596) 


Tesources 


1 For certification and licensure reports, “pass” and “fail” vvere considered performance levels. 


Score Reporting Features in Admissions Testing Programs 


AlI eight reports for the admissions tests presented results for multiple subyect areas (e.g., quan- 
titative reasoning, verbal reasoning) and included numerical total scores for each area (vyhich 
vve refer to as “overall scores”). Six reports (7596) presented composite scores, or total scores 
across subiect areas, in addition to single-subiect overall scores. Among those, three (5096) pre- 
sented both types of scores vvith the same level of detail. Hovvever, there vvere tvvo reports in 
vrhich performance levels vvere only provided for overall scores, and one report in vvhich infor- 
mation about precision vvas reported for overall scores but not composite scores (note that the 
vvays in vvhich composite scores vvere presented are only described in text, the “Overall Results” 
section of Table 6.1 focuses on overall scores). Subscores vvere provided in four out of the eight 
(5096) admissions testing reporfts. 

"There vvere several patterns in hovr overall scores vvere presented. In every case, both a scaled 
score and its corresponding percentile vvere included. One report also provided stanines. Nota- 
biy, the score reports for the tvvo undergraduate admissions tests categorized overall scores in 
relation to college readiness benchmarks, a unique feature reflecting the programs” goal. Almost 
all reports (seven, or 8896) included a visual display of overall performance—either a table vvith 
numbers or a horizontal bar vrhere a symbol or band denoted the location of the test takers 
score. Five reports (6306) included information about overall score precision: tvyo used vvritten 
explanations only, tvvo used both a vvritten explanation and a visual representation of measure- 
ment error (e.g., a score band vvhere the vvidth ofthe band reflected the amount of imprecision), 
and one provided a “personal score range” that incorporated the standard error of measurement 
(SEM) for the overall score. 

"Three of the four reports that presented subscores (7596) provided the number and percent- 
age of items ansvvered correctly vvithin select content areas. One report used scaled scores to 
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present subdomain feedback, and that vvas the only report that included performance levels 
for subscores (in relation to college readiness) as vvell as information about subscore precision. 
Regarding visual displays, tables vvere used in three out of the four reports (7596) vvith subdo- 
main results. 

VVith respect to explanatory text on the reports, only one of the eight reports included a 
statement about the purpose of the test from vvhich results vvere derived. Several reports (five, 
or 6306) provided guidance on desirable next steps, such as sending scores to academic institu- 
tions, requesting additional reports, and deciding vvhether to retake the test. One report stood 
out in that it provided information to support this decision by including five questions to deter- 
mine if retesting vvould be beneficial, and a pie chart shovving the percentage of test takers vvho 
savv an increase, a decrease, or no change in their composite score upon retesting. 


Score Reporting Features in Certification Testing Programs 


Almost all of the certification agencies sampled provided separate reports for passing and failing 
test takers (20 of 22 agencies, or 9196). As shovvn in Table 6.1, the tvvo kinds of reports vvere sub- 
stantially different. For example, “fail” score reports vvere more likely to include a numerical over- 
all score (19, or 8696) than “pass” score reports (13, or 5906). Overall scores vvere almost alvvays 
scaled scores and none of the reports included percentiles, reflecting the fact that the primary 
purpose of certification programs is to determine vvhether a candidate performs above or belovv 
a standard (i.e., criterion-referenced performance), and not to support norm-referenced com- 
parisons among candidates vvith the same pass/fail designation. Performance levels vvere alvvays 
included, either explicitly (e.g., “Result: PASS”) or implicitly (e.g., “VVe regret to inform you .. 7). 
None of the certification reports included information about the precision of overall scores. 

AlI 22 score reports for failing candidates included subdomain feedback compared to only 
eight reports (3696) for passing candidates. Reports for candidates vvho did not pass either pro- 
vided qualitative subdomain feedback in relation to performance levels or subscores, only one 
report combined both approaches. Across the eight (3696) reports that used performance levels, 
descriptions for such levels vvere only included in three. Additionally, information about the 
precision of subdomain results vvas addressed in six reports (2796) for failing candidates and 
tvvo reports for passing candidates (but only eight of the reports for passing candidates provided 
subdomain results). Visual displays vvere used by 19 reports (8696) for failing candidates and 
seven (3296) for passing candidates, vvith the most common displays being tables and horizontal 
bar graphs. 

A substantial number of score reports provided guidance on next steps, for both passing 
(16, or 7306) and failing (19, or 8696) candidates. Reports typically presented information on 
maintaining the nevvly-earned certification and obtaining certification materials or applying 
to retake the test and using the score report to guide remediation, depending on the testing 
outcome. Among reports for candidates vvho did not pass, one unique feature vvas the use of 
sympathetic language—vvords acknovvledging the disappointment associated vvith failing an 
exam and, sometimes, encouraging candidates to consider retesting. A much higher number of 
report for failing candidates (15, or 6896) included information about vvhere to find additional 
resources than repots for candidates vvho passed (seven, or 3206), vyhich is expected since those 
vyho did not pass have a greater need for additional information about the testing program. 


Score Reporting Features in Licensure Testing Programs 


Although differential reports vvere produced for passing and failing candidates for the mafority of 
score reports from the certification tests vve sampled, the same vvas not true for the score reports 
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from licensure programs. Among the eight licensure reports vve collected, six included the same 
elements regardless of the testing outcome, and tvvo vvere only for failing candidates (i.e., those 
tvvo programs did not produce reports for passing candidates). Thus, ve did not divide the licen- 
sure reports into “pass” and “fail” as vve did for the reports from certification exams. 

The mafority of licensure score reports (six, or 7596) included numerical overall scores. The 
exceptions vvere the tvvo reports made specifically for failing candidates, vvhich focused on 
subdomain results. VVhen present, overall scores vvere either scaled scores or percent-correct 
scores. Aİl reports indicated candidates” performance level in terms of a pass or fail outcome, 
but explicit descriptions of vvhat it means to perform in the “pass” or “fail” range vvere not 
included in any report. Only one report provided information about overall score precision, and 
only tvvo (2596) used visual displays to present overall results (in both cases, tables vvere used). 

Unlike reports from the tvvo other contexts, six of the eight score reports for licensure pro- 
grams presented subdomain resulits in relation to performance levels rather than numerically, 
as vvas done by only one program. Additionally, almost all reports that used performance levels 
(e.g., Lovver/ Borderline/ Higher Performance) included descriptions of those levels (five out of 
six, or 8306). Only three included details about subscore precision. In terms of visual displays 
for subdomain results, five licensure reports used tables and one used a graphic vvith horizontal 
bands representing performance. 

As vvas the case for admissions and credentialing, only tvvo of the licensure reports included 
descriptions of the purpose of the test. Unlike the tvvo previous contexts, hovvever, only three of 
the reports (3896) provided guidance on next steps. In all three cases, the suggested next steps 
involved using subdomain feedback to devise a study strategy, vvith a vvarning that candidates 
vvould be best served by revievving all content areas to some extent prior to retesting. 


Interpretive Materials for Admissions and Credentialing Score Reports 


Almost all examples of score reporting include interpretative material to help examinees and 
other stakeholders understand the content of the reports. "The most common type of inter- 
pretive material associated vvith score reports is an interpretive guide, vvhich traditionally is 
a static document mailed to stakeholders along vvith a score report. Among the reports vve 
revievved, most admissions reports (six, or 7596), licensure reports (six, also 7596), and certifi- 
cation reports for failing candidates (15, or 6806) included text about vvhere to find additional 
information. Score report users vvere typically referred to a vvebsite about the testing program, 
a vveb page about understanding score reports, or the candidate handbook (for certification 
reports). Unfortunately, it vvas not possible to gather all interpretive materials for every score 
report revievved. 

VVith the grovving popularity of online report delivery systems and the use of vvebsites to 
disseminate test-related content, several nevv types of interpretive materials have been created. 
Ferrara and Lai (2016) conducted a revievv of documentation practices that included certifica- 
tion and licensure programs. They found that test takers received information supporting score 
interpretation and use not only through interpretive guides, but also through candidate bulle- 
tins and handbooks. They also found that candidate bulletins typically described the purpose 
of the test as vvell as test day instructions and information about interpreting score reports. This 
possibly explains vvhy so fevv score reports for credentialing programs in our revievv included 
statements of test purpose. Additionally, it suggests that such programs may use candidate bul- 
letins as primary avenues to provide interpretive information pertaining to score reports rather 
than interpretive guides. 

In addition to “hard copy” guides for interpreting score reports, some testing programs 
are also using videos posted on their vvebsites to help stakeholders understand the report. For 
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example, both the American Board of Internal Medicine (2015a) and the USMLE program 
(2017) use videos slightly under 5 minutes in vvhich different parts of a score report are shovn 
along vvith voice-over narration. A video explaining the SAT score report (College Board, 2016) 
uses a similar approach, but it has more dynamic effects and lasts under tvvo minutes—perhaps 
reflecting SAT test takers” format preferences. Another approach is seen in a video published 
by the ACT program (2016), vvhich combines score report screenshots and voice-over narration 
vvith scenes vvhere a YouTube personality in the same age demographic as most ACT test takers 
appears in front of the camera to share information. 

Some testing programs also provide other interpretive material via their vvebsites. For exam- 
ple, the American Board of Internal Medicine (20155), and the Graduate Management Admis- 
sions Council (2015), provide on-demand interpretive information through features such as 
hyperlinks embedded in static score reports and buttons (e.g., “more information”) in interac- 
tive score report delivery platforms. 


Research on Score Reporting 


Developing successful score reports involves both art and science. "he “art” refers to the creative 
design process that is important for eflective communication. The “science” involves consider- 
ing the various studies that have been done to investigate vhat information people can perceive 
and comprehend, as vvell as the different types of information desired by the consumers of test 
results. In the previous section, vve described the features of current score reports in admissions, 
licensure, and certification testing. The content of these reports has been determined through 
research focusing on the information score report users desire. Based on this research, several 
models for guiding report development have been proposed. In this section, vve revievv this 
research, vvhich provides recommendations and promising methods for developing, enhancing, 
and evaluating score reports for credentialing and admissions testing programs. Our revievv of 
this literature is stratified by three testing contexts: testing in grades K-12, licensure testing, and 
certification testing. 


Applicable Research From K-12 Contexts 


Although outside the primary areas of our revievv—admissions, licensure, and certification 
testing programs—models for score report development derived from research in K-12 set- 
tings are relevant to these contexts. "There are tvvo prominent models for designing and eval- 
uating score reports: the Zapata-Rivera (2011) model and the Hambleton and Zenisky (2013) 
model, Both vvere presented as part of vvork that focused on educational reporting but apply to 
other contexts. The models offer a series of steps to guide score report development, prioritiz- 
ing thoughtful planning steps before any prototypes are created and encouraging an iterative 
approach—using information from later stages to revise and repeat earlier stages as needed. In 
Table 6.2, vve provide a brief summary of each model. The first tvvo columns in Table 6.2 list the 
steps for score report development involved in each model. The third column İists sample tasks 
that can be carried out to evaluate score report prototypes based on research by Clauser and 
Rick (2016) follovvring the Hambleton and Zenisky model. 

In essence, the Zapata-Rivera (2011) model proceeds as follovvs: identify the information 
needs of the intended audience for a score report (Phase 1): consider hovv those needs match the 
information provided by the assessment (Phase 2): design/revise score report prototypes (Phase 
3), and gather internal and external feedback on the prototypes (Phase 4). In turn, the Ham- 
bleton and Zenisky (2013) model consists of the follovving steps: lay the groundvvork for devel- 
oping reports (Phase 1): design prototypes (Phase 2): gather feedback, making revisions, and 
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seeking additional feedback as necessary (Phase 3): and establish a process to evaluate vvhether 
implemented reports continue to be used and interpreted as intended (Phase 4). 

Besides the models, multiple methods and general suggestions from research on K-12 score 
reporting have applications to reporting for credentialing and admissions programs. Some of 
these models and methods vvere summarized by van den Heuvel, Zenisky, and Davis-Becker 
(2014) and by Rick and Keller (2015). These revievvs highlight the importance of recognizing 
that the most important information to include in a score report and the best vvay to present 
it vary vvidely depending on the intended audience(s), the purpose of the assessment, and the 
psychometric properties of the data from vvhich scores are derived. Thus, incorporating general 
principles of good report design (e.g., Hullman, Rhodes, Rodriguez, 6: Shah, 2011) Tacoby, 1997, 
Tufte, 1983, 1990, VVainer, 1997) is important, but is no substitute for collecting direct feedback. 
For that reason, much of the research on score reporting has focused on methods for gathering 
feedback from stakeholders, and the types of feedback that are needed. Tn the next section, vve 
revievv this research vvith respect to admissions and credentialing testing programs. 


Score Reporting Research From Credentialing Contexts 


Several published studies and conference presentations from admissions and credentialing con- 
texts have described procedures for gathering feedback on operational or draft score reports. 
Tones and Desbiens (2009), for example, used a four-question survey to investigate hovv vvell 
53 residency applicants could interpret their United States Medical Licensing Exam (USMLE) 
scores. The survey simply asked, “VVhat vvas your score?” and “VVhat percentile does this 


Table 6.2 Summary of tivo score report development models and sample tasks. 


Zapata-Rivera (2011) model 


Hambleton and Zenisky (2013) 
model 


Sample tasks from Clauser and Rick 
(2016) 


Phase 1: Gather assessment 
information needs 

Phase 2: Reconcile those needs 
vvith available assessment 
information 


Phase 3: Design/revise score 
report prototypes 


Phase 4: Evaluate report 
prototypes internally and 
externally 


Phase 1: 

a, Articulate score reporting 
considerations throughout test design 
decisions 

b. Identify intended audiences 

c. Complete needs assessment for 
each intended audience 

d. Reviev the literature and relevant 
documents 


Phase 2: Create draft reports 


Phase 3: Gather feedback on 
proposed reports (revise and repeat as 
necessary) 


Phase 4: Once reports become 
operational, evaluate stakeholder 
feedback in terms of accessing, 
interpreting, and using the reports 


1a. Revievved and expanded an 
existing internal document outlining 
intended inferences from score 
reports 

1b. Identified target users in the 
process of expanding the inferences 
document 

Ic. Postponed until Phase 3 

1d. Conducted a literature revievv 


2. Developed eight report prototypes. 
In collaboration vvith staff, selected 
three for the next phase. 

3. Conducted a focus group vvith 
medical students (target audience) to 
collect feedback on the prototypes. 
Then, made revisions and gathered 
input on the revised prototypes 
through cognitive intervievvs. 

Lastly, sent a survey to staff to elicit 
additional input. 

4. Not reached (Phase 3 efforts are 
still in progress) 
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represent?” in relation to tvvo exams in the USMLE series. At the time, examinees received both 
a three-digit and tvvo-digit score, the latter resulting from the need to meet licensing author- 
ities” requirement that the passing score vvould alvvays be 75. They found that 30 (5796) of the 
residency applicants incorrectly perceived their tvvo-digit score on the Part 1 exam as a percen- 
tile, and 31 (5896) applicants made the same vvrong assumption for scores on the Part II exam. 
Thus, this quick and straightforvvard data collection approach provided evidence supporting 
the researchers” hypothesis that some examinees misunderstood one of the tvyo scores provided 
on score reports (tvvo-digit scores vvere discontinued in 2011). 

In another effort related to USMLE score reports, Rick and Clauser (2016) conducted cogni- 
tive intervievvs vvith 12 medical students as they interacted vvith three report prototypes. There 
vvas interest in understanding vvhat report features best supported adequate interpretations and 
remediation plans, so all prototypes displayed the performance of an examinee vvho did not 
pass. Participants received the prototypes in varying order and vvere asked to “think aloud” 
vvhile considering three guiding questions: “VVhat do 1 see? VVhat does this mean to me? VVhat 
can 1 do vvith this information?” (p. 6). 

After analyzing the content of the intervievvs, Rick and Clauser (2016) concluded that vvhen 
examinees received subdomain feedback both in relation to the national average and in relation 
to their ovrn overall performance, they found it easier to interpret the former. There vvere 52 
correct and tvvo incorrect “compared to the national average” interpretations, vvhile there vvere 
43 correct and 11 incorrect “compared to your ovvn overall performance” interpretations. Some 
students aptly combined both types of feedback, but others had trouble understanding that 
“overall performance” referred to their performance level across all subdomains. In terms of 
remediation, there vvere seven times as many “adequate” plans (50 statements) as “inadequate” 
plans (seven statements). Remediation plans vvere deemed adequate vvhen students expressed 
that they vvould spend more time on their vveakest areas vvithout ignoring other areas in vvhich 
their performance vas also displayed as less than ideal. This type of plan is important because 
the exam is designed to be integrative. Rick and Clauser noted that sentences explaining that 
the exam is integrative and students should revievv all subdomains prior to retesting vvhich 
vvere included in all score report prototypes—likely contributed to the high number ofadequate 
remediation plans. 

In the area of teacher certification, Klesch (2010) demonstrated the benefits of obtaining 
input from multiple stakeholders and adiusting data collection methods according to the level 
of detail needed at each point. First, she created three examinee score reports based on the K-12 
literature. Then, she gathered feedback from 16 educators through individual meetings that 
included an intervievv and a questionnaire vvith preference- and comprehension-based ques- 
tions. The meetings vvere conducted via an online video conferencing tool, vvhich helped recruit 
a geographically diverse sample. 

"The intervievvs and questionnaires revealed several trends about hovr teachers interpret score 
reports. For example, several teachers suggested eliminating abbreviations and pure statistical 
terms: one teacher mentioned that even the use of “N” to represent “number” could be con- 
fusing. In addition, teachers had a strong preference for seeing ravv scores vvhen possible, and 
some even attempted to compute percent-correct scores from scaled scores to understand them 
better, vvhich leads to a false result in most cases. After this stage of data collection, Klesch 
(2010) gathered feedback from six educational testing professionals through focus groups. 
Upon completion of the study, she offered a number of conclusions about teachers” preferences 
and information needs, including “confidence intervals vvere not immediately understood or 
seen as useful, vvhile the performance of passing examinees provided an important contextual 
framevvork: landl scaled scores need more explanation in hovv they are related to and derived 
from ravr scores” (p. 138). 
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Score Report Research in Admissions Testing Programs 


In addition to providing subscores, one approach for providing diagnostic information to 
end-users of test results involves using “item mapping” to enhance performance category 
descriptions (PCDs). Hambleton and Sireci (2008) led an effort to develop “clear, meaningful, 
and instructionally relevant” (p. 3) PCDs for the SAT using item mapping and assistance from 
content experts. Their process involved first calibrating previously administered SAT items 
from seven forms onto a common IRT scale. Then, equipercentile equating vvas used to find 
the IRT score intervals corresponding to six intervals on the SAT scale (e.g., 200 to 290, 300 to 
390, and so on), vvhich vrould be the focus ofthe PCDs. Next, content experts vvho vvere familiar 
vvith the SAT program vvere recruited to collaboratively develop PCDs using the item mapping 
information. Booklets vvere prepared in vvhich experts could see vvhich items test takers in a 
given interval vvere likely to ansvver correctly (based on at least a 6596 probability of a correct 
ansvver), hovv response probabilities differed across intervals, and other relevant information. 
In a series of tvvo to three meetings, panels drafted and finalized PCDs, vvhich vvere revievved 
by consultants and staff and sent back to the content experts for a last round of feedback. This 
process vvas conducted separately for mathematics, critical reading, and vvriting. 

According to Hambleton and Sireci (2008), nearly every one of the 20 content experts vvho 
participated in the process provided comments during the final round of feedback, but all 
changes suggested vvere editorial in nature, reflecting vvidespread consensus over the substance 
of the PCDs. This result suggests that clearly communicating the goals of item mapping and 
PCD development, and providing experts vvith several avenues to offer input into the process, 
are helpful steps in developing effective PCDs. For assessments that are largely unidimensional, 
carefully developed PCDs can provide valuable diagnostic information to improve stakeholders” 
understanding of their performance and, if necessary, inform remediation plans. 

Finally, Povvers, Li, Suh, and Harris (2016) described efforts to improve ACT score reports 
through the addition of “reporting categories”” Starting in late 2016, scores on reporting cate- 
göries such as functions, algebra, and geometry replaced ACT subscores. According to Povvers 
et al,, the advantage of reporting categories is that they are more closely aligned vvith college 
and career readiness standards and are provided along vvith readiness benchmarks that help 
students better prepare themselves for college. 


Research on Score Report Quality 


In terms of enhancing score reports, there is a considerable body of research on the psycho- 
metric quality of subscores across a number of credentialing and admissions testing programs 
(Haladyna 8: Kramer, 2004, Lyren, 2009, Puhan, Sinharay, Haberman, 6: Larkin, 2008, VVed- 
man 6: Iyren, 2015). A full revievv of those studies is beyond the scope of this chapter, so vve 
focus on recommendations for communicating—rather than computing or evaluating—sub- 
scores (readers seeking a more in-depth discussion of subscores are referred to Sinharay, Puhan, 
Haberman, 8: Hambleton, this volume). 

One important study in this area vvas conducted by Luecht (2003), vvho compared four 
methods of computing subscores for credentialing tests and discussed points to consider vvhen 
deciding to report them. To ensure that subscores are interpreted and used as intended, he 
emphasized the need to match numerical information vvith the appropriate display. Luecht pro- 
vided several recommendations based on a revievv of long-established resources about creating 


good graphics, including: “shovv the data or legitimate patterns that represent the data,” “avoid 


distortions,” “encourage visual comparisons,” and “make sure that the graphic(s) is/are closely 
integrated vvith statistical and verbal descriptions of results” (pp. 18-19). He concluded that 
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diagnostic feedback should help candidates understand their strengths and vveaknesses in an 
unambiguous vvay, and hovr veell subscores and related graphs are understood by their intended 
audience should be tested empirically rather than assumed. 

One of the most prevalent challenges associated vvith reporting subscores is communicat- 
ing their precision. Omitting information about measurement uncertainty may be misleading 
(American Educational Research Association İAERAl, American Psychological Association, öz 
National Council on Measurement in Education, 2014), but presenting too many technical 
details may be equally problematic. Phelps, Zenisky, Hambleton, and Sireci (2012) offered sev- 
eral examples of hovr certification and licensure assessment programs report reliability and 
measurement uncertainty. Inspecting nine score report samples and ancillary documents from 
accounting, lavv, medicine, nursing, and teaching programs revealed a mix of approaches. Tvvo 
organizations provided diagnostic feedback in the form of subscores vvith confidence bands, 
one organization included text about possible sources of measurement error, and tvvo organi- 
zation did not present information about precision, but used computerized-adaptive testing 
algorithms to ensure that pass/fail decisions vvere based on a pre-specifted level of precision. 
Many organizations only provided information about score and subscore imprecision in pub- 
lished papers, invited presentations, and technical documents, some of vvhich vvere not directly 
available to the public. 

Considering score reports along vvith other supporting materials, Phelps et al. (2012) found 
that licensure programs tended to provide less information about precision than educational 
testing programs, and smaller programs provided fevver details than larger programs. Ferrara 
and Lai (2016) had similar observations and noted that larger programs are likely better able to 
provide information about precision and other technical aspects due to higher testing volumes 
as vvell as potentially more resources to support scoring procedures and report development. 
Ferrara and Lai also added that licensure programs tend to offer more technical information to 
test takers than certification programs, and that might be due to the usually higher stakes asso- 
ciated vvith obtaining a license versus a certificate. 


Validity Issues in Score Reporting 


In the vvorld of testing, validity refers to “the degree to vvhich evidence and theory support the 
interpretations of test scores for proposed uses of tests” (AERA et al,, 2014, p. 11). This defini- 
tion, from the Sfandards for Educational and Psychological Testing (hereafter referred to as the 
Standards), makes it clear that validity does not refer to an inherent property of a test, but rather 
to hovv test scores are used and interpreted (see Tannenbaum, this volume, for other related 
definitions). 

The interpretation of a test score begins vvith a person vievving a score report. Thus, the design 
and dissemination of score reports directly affect the degree to vvhich a test has its intended 
eflects. For this reason, the AERA et al. (2014) Standards mention the importance of properly 
reporting test results in several chapters. 

In the chapter on “Test adıinistration, scoring, reporting and interpretation? the Standards 
point out, “Reports and feedback should be designed to support valid interpretations and use, 
and minimize potential negative consequences” (AERA et al., p. 119). This recommendation 
sums up the guidance provided by the models for score report development (e.g., Hambleton 8: 
Zenisky, 2013, Zapata-Rivera, 2011), and seems to be adhered to by the admissions and creden- 
tialing score reports vve revievved. 

The Sftandards also point out the importance of providing explanations of score reports to 
prevent misinterpretations. As they suggest, “Interpretive material should be provided that is 
readily understandable to those receiving the report” (AERA et al., 2014, p. 112). This standard 
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suggests that sufficient supporting material should be provided so that the score reports are 
easily comprehensible to those vvho receive them. In some cases, for example vvhen admissions 
test scores are reported to parents, translations of the interpretive information may be necessary 
(e.g., College Board, 2017). 

The Sfandards also provide suggestions for conducting research to help develop explanatory 
material to accompany score reports. As they put it, 


VVhile test users are primarily responsible for avoiding misinterpretation and misuse, the 
interpretive materials prepared by the test developer or publisher may address common 
misuses or misinterpretations. To accomplish this, developers of reports and interpretive 
materials may conduct research to help verify that reports and materials can be inter- 
preted as intended (e.g., focus groups vvith representative end-users of the reports. 


(p. 119) 


Based on our revlevv of the literature, many testing programs are also adhering to this guide- 
line as the development of their score reports and interpretive information has been informed 
by research, much of vrhich involved gathering perceptual and preference data from their stake- 
holders (e.g., /ones 6: Desbiens, 2009, Klesch, 2010, Rick 8: Clauser, 2016). 

Other validity issues discussed in the AERA et al. (2014) Standards vvith respect to score 
reports are ensuring that score reports are corrected vyhenever errors are found, and ensuring 
privacy and confidentiality in score reporting. The Sftandards also point out that vvhen compos- 
ite scores are formed from different components ofa test, it should be clear to end-users hovv the 
composite vvas developed. For example, they state, “If tests vvill be combined into a composite, 
candidates should be provided information about the relative vveighting of the tests” (p. 182). 

VVith respect to score reporting in credentialing testing, the Sfandards explicitly encourage 
reporting information to candidates vvho do not pass the exam, but they also point out that the 
psychometric property of any “diagnostic” scores should be established. For example, in the 
“VVorkplace and Credentialing” chapter, the AERA et al. (2014) Sftandards state, 


Candidates vvho fail may profit from information about the areas in vrhich their perfor- 
mance vvas especially vveak. "his is the reason that subscores are sometimes provided. 
Subscores are often based on relatively small numbers of items and can be much less 
reliable than the total score. Moreover, differences in subscores may simpiİy reflect mea- 
surement error. For these reasons, the decision to provide subscores to candidates should 
be made carefully, and information should be provided to facilitate proper interpretation. 

(p. 176) 


Thus, validity issues in reporting scores for credentialing exams are not limited to the report- 
ing of the pass/fail distinction. Like other score reports, the validity of all the information pro- 
vided should be supported by both theory and evidence. For example, the theory that dictated 
the definition of the construct (e.g., unidimensional or multidimensional) should be consistent 
vvith the scores that are reported, and evidence that the reported scores have sufficient reliabil- 
ity for their intended purpose should be provided. Of course, the amount of evidence needed 
for a given interpretation is related to the stakes associated vvith the use of the test score. Thus, 
reliability expectations for the pass/fail score vvill be higher than reliability expectations for sub- 
scores reported for diagnostic purposes. Nevertheless, all reported scores need to have sufficient 
evidence that they provide useful information and are understood by end-users. 

The AERA et al. (2014) Standards describe five sources of validity evidence “that might be 
used in evaluating the validity of a proposed interpretation of test scores for a particular use” 
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(p. 13). A description of these five sources is beyond the scope of the present chapter and so 
readers are referred to the Sfandards and to other descriptions in the validity literature (e.g., 
Sireci £z Gandara, 2016, Sireci 8: Soto, 2016, Tannenbaum, this volume). Hovvever, it should be 
noted all five sources of evidence are relevant to the evaluation of score reports. 

Recently, O”Leary, Hattie, and Griffin (2017) argued that a sixth source of validity evidence 
should be added to the list—evidence of appropriate interpretability of test scores. Specifically, 
they argued that “alignment betvveen intended interpretations and use of scores and actual 
interpretations and use of scores is critical” (p. 16). O”Leary et al. recommend that test valida- 
tion should include evaluating the degree to vvhich end-users of tests correctly interpret test 
results. They pointed out such research has been lacking in validity arguments developed for 
testing programs, and they lamented, “It is almost absurd to think that the intended interpre- 
tations and uses of test scores might fail because there is a lack of alignment vvith the actual 
interpretations made and uses enacted by the audience” (p. 16). VVe revisit this perspective in 
the final section of this chapter. 

Lastly, it is important to note that in addition to AERA et al. (2014), the National Commis- 
sion for Certifying Agencies (NCCA, 2014) and the International Test Commission (ITC, 2014) 
also provide score reporting guidelines that apply to admissions and/or credentialing testing 
programs. The three organizations hold similar vievvs of the responsibilities of testing programs 
tovvards their intended audiences, and their recommendations complement rather than conflict 
vvith each other. Davis-Becker and Kelley (2015) provide an excellent summary of the key ideas 
found across guidelines from AERA et al,, NCCA, and TTC, as vvell as suggestions for hovv cre- 
dentialing programs can meet those guidelines. 


Looking Forvvard: Future Research and Practices in Score Reporting 


In this chapter, vve discussed research and practices in score reporting for admissions and cre- 
dentialing exams, and vve contrasted these practices vvith professional standards for testing. In 
general, the score reporting practices vve revievved vvere consistent vvith the AERA et al. (2014) 
Standards and vvith other guidelines for best practices in this area (e.g., Hambleton 8: Zenisky, 
2013, NCCA, 2014, Zapata-Rivera, 2011). Hovvever, tests in these areas are under increasing 
scrutiny and also are experiencing significant grovvth. Thus, vve expect score reporting to receive 
more attention, and to become more interactive, including links to extensive interpretive mate- 
rial available online (e.g., American Board of Internal Medicine, 2015a, USMLE, 2017). 

Perhaps the most important standard in the AERA et al. (2014) Standards is, “A rationale 
should be presented for each intended interpretation of test scores for a given use, together vvith 
a summary of the evidence and theory bearing on the intended interpretation” (p. 23). Given 
that score reports are the seeds from vvhich test score interpretations grov, in the future, vve 
vvould like to see research-based evidence to support the reliability and utility of a// information 
included on a score report. 

VVe also predict the perspective of O”Leary et al. (2017) is likely to gain traction, and research 
on score reporting vvill be more commonly conducted as part of developing a validity argument 
to support the use of a test for a particular purpose. They claimed that “Broadening validity 
evidence to incorporate a notion of evidence of interpretability could be achieved quite simpiy 
by including evidence of score report interpretability as one of the forms of validity evidence” 
(p. 20). VVe are not sure that a nevv category of validity evidence is needed, because research on 
test score interpretability could be couched vvithin the source of validity evidence knovmn as 
validity evidence based on testing consequences (AERA et al,, 2014). Nevertheless, regardless 
of hovv such evidence is categorized, vve agree it is essential, and vve hope to see more of it in the 
near future. 
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In summary, score reports for admissions and credentialing programs appear to have stake- 
holders” needs in mind and strive to provide the information they need in a comprehensible 
format. Hovvever, reporting the results from these assessments is complex, and so test develop- 
ers and testing agencies continue to improve their reports based on research to facilitate proper 
score interpretation, and minimize misinterpretations. The addition of URLs to score reports 
and accompanying interpretive material that indicate hovv to get additional information is an 
important trend that is likely to facilitate proper interpretation of test results. VVe hope future 
research in this area vvill confirm that hypothesis. 
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Score Reports for Large-scale Testing Programs 


Managing the Design Process 


Sharon Slater, Samuel A. Livingston, and Marc Silver 


People pay to take a test—or pay to create a test and require other people to take it—because they 
vyant or need the test results for some purpose. The information on the score report is the product 
they are paying for, not the test itself. The most carefully developed, research-tested procedures for 
assessment design, item development, and psychometric analysis vvill be vvasted if the score report 
does not communicate the test results in a vvay that encourages proper interpretation and use. 

"his chapter is vvritten for people vyhose responsibilities include the development of score 
reports for a large-scale testing program. Typically, in this design process, the people vvho design 
the report are not the ones vvho vvill make the final decision as to vyhether the design is approved. 
VVe refer to the person or group vvho vvill make that decision as the “client” In some cases, the 
client vvill be a person or group of staff members at the same organization as the designers of the 
report—usually a testing company or agency. In other cases, the client vvill be a member of another 
organization that has contracted vvith the testing company for development vvork that includes the 
design of the report. Our purpose in vrriting this chapter is to provide the reader vvith the benefits 
of our experience in the design process. VVe think this information vvill be useful to both score 
report designers and to clients, and ultimately to the consumers of score reports. 

In the past decade, there have been several publications containing recommendations for 
score report design (Hambleton 8: Zenisky, 2013, Hullman, Rhodes, Rodriguez, 8: Shah, 2011, 
Tannenbaum, this volume, Zapata-Rivera, 2011, Zapata-Rivera 6: Katz, 2014: Zapata-Rivera, 
VanVVinkle, $: Zvvick, 2012, Zenisky 6: Hambleton, 2016). Hovvever, these published recommen- 
dations do not alvvays translate smoothly to practice. Score report designers find their options 
limited by the available funding, technology, and display space. VVhen clients” vvishes conflict 
vvith vvhat the report designers recommend, the process of designing score reports becomes 
even more complicated. In the follovving pages, vve discuss several factors to consider during 
the score report design process, including: communicating vvith the client and negotiating vhat 
score information viill be included on the report, deciding vvhat additional descriptive informa- 
tion to include on the report, and deciding hovv to present that information. 

VVe guide readers step-by-step through a score report design process that vve have found to 
be successful in K-12, higher education, and business settings. This design process relies heavily 
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on principles of graphic design and user experience, and incorporates recommendations based 
on the score reporting literature. 


Form a Team of Experts 


Before beginning the score report design process, it is important to have the right people in 
place to do the vvork. VVe recommend assembling a team of people vvhose purpose, as a group, 
is to design and evaluate score reports. "he mission of the group should be to develop score 
reports that are technically accurate and easily understood by the intended user. The members 
of this team should have different kinds of expertise, to bring different perspectives to the pro- 
cess. Some kinds of expertise are required at one stage of the process: other kinds of expertise 
are required at other stages: 


ə Graphic designers knovv hovr to create visually pleasing arrangements of report elements. 

ə User experience practitioners understand hovr design and vvording decisions can add 
to or detract from the usability of the report, and they can recommend improvements. 
"The term user experience, often referred to as UX, seems to have many definitions, and 
often refers to computer systems or vveb design. Hovvever, in our context, user experience 
experts keep in mind all aspects of the end-user"s perception of the score report vvhen 
considering the design. These aspects include hovv users vvill interact vvith the report, hovv 
vvell the information in the report is communicated, hov the look and feel ofthe report is 
perceived, and hovv easy the report is to use and understand. 

ə Cognitive science researchers knovv vhat research studies have shovvn about communi- 
cating information so the audience can easily understand the intended message. (See the 
chapter in this volume by Mary Hegarty on hovv findings from Cognitive Science and 
Information Visualization can inform score reporting.) 

ə Psychometricians understand the limitations of each type of score and can recommend 
scores that vvill be adequately supported by the data. 

ə Assessment developers can identify the abilities that the test measures and describe them 
in language that the intended audience is likely to understand. 

ə Information technology (TT) staff can determine vvhat content and format are technically 
feasible for the production of the report (online or printed). 

ə Acecessibility experts ensure that the score report can be easily and correctly interpreted 
by those vvith visual impairments and/or those using assistive technologies, such as screen 
readers. 


Depending on the resources available, it may not be possible to assemble a design team 
vvith all the necessary types of expertise. Sometimes a team member can fill more than one 
of these roles: for example, the graphic designer can also be the user experience expert if 
he or she has the necessary knovvledge, or the TT expert may be sufficiently vvell-versed in 
accessibility features to fill that role. In the case vvhere professionals vvith the types of exper- 
tise listed above are not available vvithin an organization, vve recommend hiring consultants 
to provide the missing skills. VVith the various perspectives represented, it is important for 
team members to respect each other? expertise. They must each realize that vvhat appears 
best from their point of vievv could be impractical for some reason they hadnt considered. 
For example, a text change recommended to improve the usability of a report may cause an 
unintended change to the interpretation of the scores, prompting an obiection from the psy- 
chometrician or assessment developer. The team must vvork together to create solutions that 
are acceptable to all. 
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In this chapter, vve describe the process that vve use for designing score reports for an external 
client. The procedure is somevvhat simpler if the client is part of the same organization as the 
people actually designing the report. In that case, the design team vvill knovv more about the 
decision makers and the factors that vvill influence their approval of the design. For the remain- 
der of the chapter vve vvill refer to three separate entities vvith involvement in the score report 
design process: 


ə The design team is the group described above, 

ə "Ihe client is the person or group that the organization paying for the test designates as 
responsible for the score report being designed. This person or group represents the audi- 
ences” interests and is often staff from a state department of education or a credentialing 
agency: 

ə The program team consists of staff from the testing agency, including the program man- 
ager vvho is the person responsible for communicating vvith the client and keeping the 
proyect on schedule. In the case vvhere there is no external client, the program team is the 
client, as vvell. 


Clear, ongoing communication among these three groups is essential. From beginning to 
end, it is important for the design team to vvork as directly as possible vvith both the program 
team and the client staff. Both the design team and the program team need to understand vhat 
information the client vvould like to include on the score report. That information vill nearly 
alvvays include some kind of overall test score. İt may also include classification levels, sub- 
scores, graphs, photographs or illustrations, and vvritten text intended to help score users to 
understand the results or to help test takers interpret their scores or improve their performance. 
Our experience has taught us that it is important for the design team (or at the very least, the 
graphic designer) to get feedback about the score report designs directly from the client. Tt is 
also vvise to find out vhat individual vvill have to approve the final design and to involve that 
person in the design process as early as possible. If not, the design team may not correctly 
understand the decision makers vvishes, leading to vvasted design effort, valuable time lost, and 
frustration for all involved. 

Communication vvorks a little differently vvith every testing program. One strategy that 
vvorks particularly vvell is to have a one-day or tvro-day vvorking session completely devoted to 
designing the report. The participants are the graphic designer, a psychometrician, and an TT 
staff member from the design team, one or tvvo members from the program team, and key staff 
on the client side. At this score report design retreat, the participants vvork together, brainstorm- 
ing, and sketching ideas, vvith the designer modifying the draft score report designs (mockups), 
in real time as changes are suggested. Having such a vvork session makes it possible to make 
significant progress quickly. Hovvever, travel costs and scheduling constraints often make this 
type of a meeting impractical. More typically, vve hold regularly scheduled conference calls as 
part of the design process, as often as vveekly. In one case, our main communication vvith the 
client vvas only indirectly through the program manager by vvay of one-line emails sent from a 
smartphone. The result vvas a report design that the client considered unacceptable, follovved by 
substantial revvork and increased costs. Clear and documented communication vvith the client 
is important in order to avoid such an outcome. 


Score Report Design Process: Step by Step 


Ideally, score report design begins at the very beginning of the test development process. An 
early step in the process should be the creation of a prospective score report shovving vhat 
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information the test is intended to provide. "he test can then be designed to provide that infor- 
mation. This is good advice, vvhether the test development process is follovring evidence-cen- 
tered design principles (Tannenbaum, this volume: Zapata-Rivera et al,, 2012, Zieky, 2014) or 
not. Hovvever, this ideal situation often is not vvhat occurs in practice. More commonly, the 
test development process is vvell undervvay before the test designers begin thinking about the 
score report. Often, vve find ourselves developing a score report for a test that is already fully 
developed, or nearly so. Regardless of vvhen the process of designing score reports begins, the 
follovring step-by-step procedure can be applied. Figure 7.1 is a graphic representation of the 
score report design process, vvhich is described in detail belovv. 


Design Process: Step 1 
Gather mformation About the Test and the Scores to Be Reported 


The process begins vvith the design team asking the program team for information about the 
test and hovv it vvill be used. This “kick-off” meeting is an opportunity for the program team to 
provide background and context about the clients needs before the design team and client meet 
directly. Our design team has developed a questionnaire to help guide this initial discussion. It 
begins vvith open-ended questions intended to help us learn about the test for vvhich the score 
report vvill be designed: 


ə  VVhat kinds of knovvledge or skilİs does the test measure? 

ə VVho vill be the main users of the score report? 

ə  VVhat decisions vill be made vvith the information from this score report? 

ə Is there anything unusual about the testing program that may create difficulties for score 
report users? 

ə VVhat are the critical deadlines for designing the score report? (State board meetings, TT 
programming schedules, score reporting deadlines?) 


Once the design team understands these general score reporting needs, they ask more spe- 
cific questions like those listed belovv. The design team should have as much of this informa- 
tion as possible for each type of score report to be designed. The program team may not have 
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Figure 7.1 İterative score report design process. 
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ansvvers to all of these questions, but they can usually provide enough information about the 
clients vvishes to get the design vvork started. 


, s this a nevv score report or a revision? 
, 1 there an existing score report? İf not, are there any sketches for suggested report formats? 
. VVho are the test takers? 
. VVho is the primary user of this score report? 
. VVill there be other users of this score report? 
. VVhat vill the primary user vvant to knov, first? VVhat is most important to the user of this 
score report? 
7. VVhat scores vvill this report include? 
8. VVill any reliability statistics (e.g., standard error of measurement) be included on this 
rebort? If so, vvhat statistics, for vvhich scores? 
9. VVill this score report include combarative data, such as percentiles, group averages, or bre- 
vious year scores? 
10. VVill this score report include any other kinds of information (proficiency levels, grovvth, etc.)? 
11. VVill this score report be produced in-house or by an outside vendor? 
12. VVill this score report be delivered on paper? Elecironically via email? Online? 
ə İf the report vyill be produced on paper, is the report limited to a number of pages/sides? 
İf so, hovv many? 
e İf the report vvill be online, vvhat size screen is the reader assumed to have (e.g., desktop, 
tablet, smartphone)? 
e İf the report is online, is conditional text required/desired (e.g., “İf your score is betvveen 
X and Y that means..”)? 
ə İf the report is online, vvill the reader have the option to choose vvhich information is 
displayed? 
ə İf the report is online, vvill the reader be able to choose among different report formats? 
13. Are there certain colors that must be used? Any specific branding or logos? 
14. Is a sample score report required for a vvebsite, brochure, marketing materials or interpre- 
tive guide? 
15. Are any needs assessments, usakility studies, or focus group sessions planned? 
16. Is there any other important information to consider in designing this score report? 


O 4 $ Mə x 


Using the questionnaire to focus the discussion, the design team typically can gather the 
information needed to begin the design process by talking vvith the program team for an hour 
or tvvo. After that information-gathering session, the graphic designers vvill have the informa- 
tion they need to begin their vvork of creating preliminary mockups of the score report. For 
any questions the program team cannot ansvver, the design team can offer suggestions or make 
recommendations. They can offer to create different versions of the score report to shovv hovv 
various options vvould look (e.g., versions vvith and vvithout percentiles or group averages, ver- 
sions vvith and vvithout standard errors shovvn graphically). 


Design Process: Step 2 
Create a Schedule for the Score Report Design 


In Step 1 the design team gathers information about critical deadlines for the score report 
design. These dates vvill drive the creation of the score report design schedule in Step 2. A score 
report design that must be completed in tvvo months vvill have a very different schedule from a 
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score report design that must be completed in six months. In scheduling, it is generally a good 
idea to start vvith the targeted end date for the design process and vvork backvvards. The end of 
the design process is usually the date at vvhich the TT staff needs to begin coding the systems 
for production of the score report (to be distributed electronically, printed on paper, or both). 
Although there does not seem to be such a thing as a “typical” score report profect, Table 7.1 
shovvs an example of a high-level schedule for the score report design process. 

If time permits, vve begin by scheduling tvvo vveeks or so for the designer to create preliminary 
concepis, follovved by three rounds of mockups that vvill be shared vvith the program team and the 
client in an iterative process of revievv and revision. Additional iterations oftthe revievv-and-revision 
sequence can be added as needed. More complex reports usually require more iterations, but the 
time available vvill place a limit on the number of iterations possible. VVe have finalized designs in 
as fevv as three iterations, or “rounds,” but some profects have involved over 20 rounds for a single 
report design. The number of rounds vvill depend somevvhat on the complexity of the score report, 
but it vvill also depend heavily on the number ofdecision makers involved in the process and on hovv 
effectively the program team can manage the schedule and keep the vvork on track. The schedule 
in Table 7.1 allovvs about three months (counting business days only) for design of a score report: 
but vve have seen profects vvith shorter timelines and profyects that took much longer to complete. 


Design Process: Step 3 
Begin Creating Graphic Designs 


One thing that can be helpful at the outset is to get a “napkin sketch”—a very rough dravving of 
the score report—from the client. Creating the napkin sketch forces the client to think about 
some of the goals and issues that the design team vvill face. The graphic designer then uses 
the napkin sketch, and/or the ansvvers to the questions asked in the kick-off meeting to create 
several sample score report concepts, usually using a professional dravving program such as 
Adobe InDesign” or Adobe ITllustrator”. These concepts are not mockups of the full report. They 


Table 7.1 Sample schedule for design of one score report vith three rounds of mockups. 


Score Report Design Schedule Activities Duration 
Kick-off meeting to gather information about the score report 1day 
Designer creates multiple design concepts for the score report 10 days 
Design team revievvs concepts and provides feedback to designer 2 days 
Designer creates Round 1 mockups 4 days 
Design team revievvs Round 1 mockups and provides feedback to designer 2 days 
Designer revises Round 1 mockups based on design team comments 4 days 
Program team revievvs Round 1 mockups 5 days 
Design team and program team meet to discuss feedback on Round 1 1day 
Designer revises based on feedback to create Round 2 mockups 4 days 
Design team revievvs Round 2 mockups and provides feedback to designer 2 days 
Designer revises Round 2 mockups based on design team comments 4 days 
Program team and the client revievv Round 2 mockups 5 days 
Design team, program team and the client meet to discuss feedback on Round 2 1day 
Designer revises based on feedback to create Round 3 mockups 4 days 
Program team and the client revievv Round 3 mockups 5 days 
Design team, program team, and the client meet to discuss feedback on Round 3 1day 


Finalize score report design or continue additional rounds until complete 


Total days 55 days 
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are examples of the various options for each section of the score report. This is the time for the 
graphic designer to try out different options for types of graphs, for the size, placement, and 
arrangement of text and numerical information, for the use of color and icons, and for page 
composition. The choice of options to try vvill be based on the design team understanding of 
client preferences and on previous experience vvith similar testing programs. At this stage in the 
process, the goal is to produce options for mockups that the client can then react to directly. 

Existing reports can provide both good and bad examples of graphics, color, layout, and 
so on, In a K-12 assessment report to parents, it is important to use familiar graphics, atten- 
tion-getting color, and brief, simple text. For score reports that vvill be used by institutions, it 
may be acceptable to use more complicated language and include more statistical information. 
If possible, the choice of language used and technical information included should be informed 
by research that has determined specific audience needs, pre-existing knovvledge about, and 
attitudes tovvard assessments (e.g., Kannan, Zapata-Rivera, 6: Leibovvitz, in press, Undervvood, 
Zapata-Rivera, 6: VanVVinkle, 2007, Zapata-Rivera et al,, 2012). Hovvever, research-based guid- 
ance in designing score reports for various situations is not alvvays available. 

To help the design team keep the needs of the intended audience in mind, sometimers the pro- 
gram team or the client vvill provide market research. Such studies can describe vvhat a particular 
audience vvants or needs to knov from the score report. At other times, the design team must rely 
on the clients fudgments or intuitions about vvhat the users of the score report vvill vvant to knovv. 
Throughout the design process, our design team uses the principles listed belovv, vvhich are consis- 
tent vvith guidelines in the score reporting literature (Hattie, 2009, Kannan et al,, in press, Under- 
vrood et al,, 2007, Zapata-Rivera, 2011) Zapata-Rivera etal., 2012, Zenisky 8: Hambleton, 2016).1n 
addition to guiding design efforts for nevv score reports, these principles can be applied to evaluate 
existing score reports. See Figures 7.2a and 7.2b for examples of tvvo vvays the same score report 
information can be displayed. Figure 7.2b vvas designed follovving the principles belovv. 
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Figure 7.2a Fxample of a score report that does not follov the design principles listed in Step 3. 


98 e Sharon Slater, Samuel A. Livingston, and Marc Silver 


5500050050” 


.. READING 


Score Level: PROFICIENT 


Student score: 


750 


5. a 


Basic Proficient Advanced 
(100-709) (710-889) (890-1000) 


ITEMS ITEMS PERCENT 


SUBSCORES CORRECT TESTED CORRECT 

Grammar T 10 NN 7070 
Vocabulary 18 21 H 3677 
Reading Comprehension 410 19 HE 5376 
TOTAL 35 50 


x m. 


Figure 7.2h FExample of a score report based on the design principles listed in Step 3. 


Principles to guide the score report process include: 


ə Emphasize the most important information in the score report. Too many score reports 
emphasize unimportant features (logos, illustrations, etc.). "The most important parts of 
the report should command the most attention. 

ə Design the report so that vievvers can see and understand the most important informa- 
tion in 10 seconds or less. VVhat is the first question that someone looking at the score 
report vvill vvant the ansvver to? It may be “VVhat vvas my score?” or “Did 1 pass?” or “Hovv 
did 1 perform in comparison to other people taking the test?” 

ə Create a strong visual hierarchy that guides the vievvers eye appropriately through the 
report. Items of information that vievvers vvill need to compare should be close together 
in the report. 

ə Eliminate visual clutter—anything printed on the report that does not convey useful 
information. Avoid repeating elements of the report, except vrhere repeating them makes 
the report clearer and easier to read. Each element on the score report should earn its 
space. 

ə Avoid using lines that add to visual complexity. Instead, use shaded areas to delineate 
space vvithout adding clutter. 

ə Use visual embellishments such as icons only vvhen they make it easier for users to cor- 
rectly interpret the report. Be especially careful to avoid any visual elements that may 
have a negative connotation to some vievvers of the report. (For example, a thumbs-up 
symbol is offensive in some parts of the vvorld.) 

ə Use colors to help convey information. Colors, used meaningfully, can make the report 
easier to interpret. Hovvever, make sure that the report can convey all its information even 
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if the vievver cannot accurately differentiate colors or if the report is printed or photo- 
copied in black and vvhite. 

ə Follovr VVeb Content Accessibility Guidelines (VVCAG: Caldvvell, Cooper, Reid, 6: Vander- 
heiden, 2008) for accessibility. Make sure there is adequate contrast betvveen background 
and text, and that vievvers using assistive technologies can easily read and interpret the 
Teport. 

ə If the score report vill be printed, take into account the accuracy of the softvvare and 
printers that vvill be used for production. For example, if the placement of printed ele- 
ments (e.g., text, symbols) can vary by 1 millimeter, the report should not include ele- 
ments that require more precision than that. 

ə Make sure that the report design can accommodate unusual but possible conditions, such 
as very long namers or very İlovr or high scores. (VVe sometimes design the sample score 
reports for a student named “Verylongfirstname Extremelylonglastname” to make sure to 
leave enough room.) 

ə Make sure the report can be economically reproduced. Avoid designs that vvill require 
large amounts of paper, toner, or ink, especially if schools or families vvill be printing the 
reports. Reduce the number of pages vvhere possible. 

ə Ensure that any language in the report is appropriate to its intended audience, at the 
proper reading level. Avoid technical language that might be difficult for non-experts to 
understand, particularly if the score users may have limited English language skills. 


At this stage, it is important to make sure that TT revievvs the designs prior to shovving them 
to clients. This step vvill ensure that the client does not fall in love vvith something that is not 
technically feasible. 


Design Process: Step 4 
Get the Clients Reactions to the lnitial Designs 


Once score report mockups have been created and the program team has had a chance to pro- 
vide feedback, vve share the mockups vrith the client to get feedback. Typically vve present three 
different versions of each report. Often the client vvill like some features of one version and other 
features of another version. Sometimes after seeing the score report elements on the page for the 
first time, the client may have nevr ideas for vvays to present a certain piece of information. VVe 
try to find a vvay for the designer to hear feedback from the client first-hand, vvith the opportu- 
nity to ask questions and get a better understanding of vhat changes the client vvants, and vvhy. 
A fevv minutes of dialog betvveen the designer and the client can save hours of unnecessary vvork 
for the designer and days in the production schedule. 


Design Process: Step 5 
Gather Feedback from Intended Users of the Score Report 


Once the client and the program team are happy vvith the score report mockups—vvhich can 
take a number of rounds of design—it is vvise to gather input from the people vvho vill actually 
use the score report. In our experience, clients often prefer to gather score user feedback using 
actual mockups that have been designed and revlevved by them first (in Steps 1-4). Tt is unusual 
for a testing program to have the resources to gather feedback from score report users more 
than once. Typically, score reports are designed based on the clients understanding of vyhat 
their score users vvill vvant and need, and those mockups are shovvn to score users in a focus 
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group setting or usability study in vvhich their feedback can be collected and applied to later 
revisions to the score report design. 

In the user experience vvorİd, this type of audience revievv is recommended for very early in 
the design process, vvhich is also consistent vvith the score reporting literature. Zapata-Rivera 
and Katz (2014) suggest performing an “audience analysis” prior to designing score reports. 
Zenisky and Hambleton (2016) also recommend conducting a needs assessment at the begin- 
ning of the score report design process and to evaluate stakeholders” interpretation and use of 
score reports at every step of an iterative multistep report development process. VVhile vve agree 
vvith these recommendations and vvould like to see early and frequent input from score report 
users, this is not vvhat ve typically see in practice. 

VVhether early or later in the design process, the best vvay to knovv vrhat the intended users 
vvant and need in a score report is to ask them directly. In this step, you can learn vvhat score 
report recipients vvant to knovv and hovr vvell they understand the information on your score 
report. İt is important to find out vvhat they like and dislike, and even more important to find 
out vyhether they are interpreting the results correctly. Questions like, “VVhat vvas the students 
overall score?” or “VVhat does the percentile mean?” give us a good idea about vvhether the score 
report design is clearİy communicating the information and vvhether any parts of the score 
report are confusing to users. Comprehension questions like these are uncomfortable to ask 
and can make users feel as if they are being tested on something they have not studied. Before 
vve ask any comprehension questions, vve emphasize that our purpose is to find out if the report 
is clear and understandable. VVe tell the users that if they have trouble ansvvering any of these 
questions, vve vvill knovr that vve need to revise the report. VVe are seeing more and more stud- 
ies that are including comprehension questions in their assessment of interpretability of score 
reports (Hambleton $x Slater, 1997, Kannan, Bryant, Zapata-Rivera, 6: Peters, 2017, Kannan, 
Zapata-Rivera, 6: Leibovvitz, in press, Mcyunkin 8: Slater, 2017, Rick et al., 2016). 

Unfortunately this step of the process—feedback from users of the report—is sometimes 
omitted due to tight reporting schedules, cost limitations, or both. VVhen there is time and 
money for this important step, there are a fevv different vvays to gather information from people 
like those vvho vvill use the score report. Options for gathering feedback from score report users 
range from broad-based surveys to focus groups, to one-on-one usability studies or cognitive 
labs vrith end users. 

Surveys are generally the most cost-effective vvay to gather information on vvhat users vvant 
in the score report. Even if the response rate on a survey is not very high, those vvho do respond 
vvill tend to be those vvith strong opinions about the topic, and their responses can provide use- 
ful feedback. If the survey includes open-ended questions, the schedule vvill have to allovv time 
for coding the responses. Overall, surveys typically do not provide much rich data (apart from 
open-ended questions) or offer the opportunity to ask follovv-up questions ofthe user. Hovvever, 
they are convenient to use and can be an efficient vvay to collect feedback. 

Small focus group sessions enable the use of more open-ended questions. These sessions 
provide an opportunity for discussions among the group members, vvhich can be observed. 
They let the facilitator and observers see the users interact vvith the score report and vvitness 
their reactions to various sections of the score report. Hovvever, focus group sessions must have 
a clearly structured protocol to be effective. They require more time than surveys, as staff vvill 
need to conduct the sessions, consolidate the findings, and prepare the report. The number of 
people (and geographical locations from vvhere you can recruit) vvill also be limited. The biggest 
dravvback, hovvever, may be the influence of group dynamics on the results. There is alvrays the 
possibility of a single focus group member imposing an opinion on the others. As a result, the 
members of that focus group may all say they had the same reactions to the report, vrhen in fact 
they did not. Or if the vocal focus group member5 personality is ob/ectionable, the rest of the 
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group may be inclined to express disagreement vvith that persons opinion, regardless of their 
ovvn underlying opinions. 

One vvay to get many of the benefits of focus groups, but vvithout the group dynamics, is to 
hold individual intervievv sessions vvith users, one-on-one. "his format is the traditional one 
used for usability testing (Silver, 2005), and cognitive labs. A usability consultant or facilitator 
can introduce tasks and ask questions vvhile observers on the design, program, and/or client 
teams vievv the sessions in person or remotely, vvith video and audio recordings made to shovv 
the participants facial expressions (to shovv reactions such as confusion), and the screen or 
paper prototype that is being tested. This strategy is more time-intensive than focus groups, 
but it provides the opportunity for the participants to think out loud and express their initial 
thoughts, interpretations, and confusions. Et enables them to provide information such as vvhich 
parts of the report they dont understand, vvhich version of the report they prefer, and vhat 
additional information they vvould like to see in the report—vvithout being influenced by oth- 
ers. The usability test results are often vvritten up in a report that summarizes issues and often 
makes recommendations for design improvements. This format provides actionable data that 
the design team can use to improve the score report design. For example, if you see the same 
misconception or point of confusion in several sessions, then you knovv you have observed 
something important. Dickey, Rick, Sireci, and Zenisky (2015) provide a revievv and bibliog- 
raphy that offers guidance on hovr to develop protocols for gathering this type of feedback for 
score report design. 

VVhatever strategy is employed for gathering user feedback, it is a good idea to test for acces- 
sibility by trying out the score report on score users vvith visual disabilities. In addition, it is 
important to keep a record of score user feedback in vvriting. That vvritten feedback could be 
in the form of notes taken during a focus group session, a transcription of a recording from a 
usability study, or ideally, a more formal report summarizing all score user feedback. 


Design Process: Step 6 
Finalize the Design 


At this stage in the process, the graphic designer applies the feedback gathered from Steps 4 
and 5 to finalize the score report design. If many of the participants express similar vievvs about 
the proposed score report, the necessary changes vvill be obvious. But if various groups react 
differently, someone—usually the client—may have to decide vvhich changes the design team 
should implement. tt can take multiple rounds of revision to get to the point vvhere the client is 
ready to approve the design. This iterative nature of the score report design process is illustrated 
as the loop in Figure 7.1. 

Once all of the decisions about the score report elements and the accompanying text have 
been made, the design process ends, and the final score report mockup can be handed over 
to the TT group vhich vvill begin coding the systems that vvill produce the report. During the 
production phase, much vvork remains to properly translate the mockup into the version of the 
report that vvill be provided to score users. Having TT staff involved throughout the design pro- 
cess is important, to make sure the proposed designs are feasible from a production standpoint 
and to make sure the TT staff understands vvhat the client and score users vvant from the final 
report. The IT staff can also keep everyone involved, including the client, informed about the 
time needed for coding the production version of the report. To minimize the risk of missing 
reporting deadlines, it is important to give the TT staff the time they need to program the pro- 
duction version of the report. For operational score reporting to vvork vvell, there are many spe- 
cial circumstances that their coding must anticipate. For example, vrhat message should appear 
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on the score report if a student misses only the mathematics portion of the test? VVhat message, 
if any, should appear if a student received accommodations like extra time or additional help 
vvith language? The design schedule must accommodate the time needed for TT staff to evaluate 
possible circumstances and to build solutions to respond vvhen the circumstances arise. The 
design team must stay engaged during this step to ensure that the production version of the 
score report continues to meet the requirements that vere identifted during the design process. 


Interactive Score Reporting 


To this point, the chapter has focused on the design process for a single static—or fixed—score 
report. Here the term “static” means that the same information vvill be presented on each score 
report in a noninteractive vvay. This is the type of paper score report that is mailed to a score 
users” home or may be available online as a PDE file or a link sent via email. Interactive score 
reports or score reporting systems, on the other hand, are alvvays presented electronically (via 
computer, tablet, or smartphone). An interactive report includes a set of displays vvith hyper- 
links and interactive menus that enable the user to select the information to be presented and to 
move from one type of viev to another. 

Demand for interactive reports has increased in recent years, yust as personal-use, mobile tech- 
nology (tablets, smartphones, etc.) has become more prevalent in our daily living. Some score 
users vvant access to score reporting information immediately and some may reqvire the ability 
to tailor reporting to ansvver specific questions. Designing an interactive score report or an entire 
score reporting system is much more complicated than designing a single static score report. Score 
reporting systems house a collection of interactive reports that allovr users to interact vvith the 
score report information. These systems encourage users to explore the data by clicking on aspects 
of a report to drill dovrn for more information, sorting the data to shovv the score information in 
a particular order, or changing the vvay in vvhich scores are displayed (tabular vs. graphical). This 
type offunctionality requires sufficient time to design each individual display vvithin the reporting 
system. You vvill need to understand and test hovv users vvill interact vvith the information in each 
report display and build intuitive vvays for users to navigate betvveen the various displays. 

Most, but not all, of the principles for designing a static score report apply to each display 
included in an interactive score report. The guidelines described above for score report design 
should be applied to the design of each display in an electronic score reporting system. In a 
sense, one can think of each screen of the reporting system as a static score report, because the 
information on each screen is intended to communicate some aspect of the testing results to 
the score user. Hovvever, designing an interactive report involves the added complexity of deter- 
mining hovv a user vvill navigate through such a system. VVhen vvorking through design of the 
navigation, there are a number of questions that must be considered: 


ə VVhat vvill the user vvant to see first? 

ə Hovv often and in vhat vvays vvill the user vvant to interact vvith the system? 

ə Hovv can the user drill dovrn to more specific or detailed information? 

ə Hovr might the user vyant to sort the information on the screen? 

ə VVhere might the user vvant to gain access to other resources? 

ə Hovv can the user easily navigate back to a previous screen or to another part of the 
system? 


This type of design vvork is vrhere the user experience staff and cognitive science researchers 
on the team have a good deal of input. Tt is here that usability testing and evaluations of the 
comprehension of the information presented are a critical piece of the design process. 
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Lessons Learned 


In addition to follovving the steps outlined above, there are a number of other lessons vve have 
learned about the score report design process. In the section belovv, vve share some of these 
important lessons vve have learned as vve have engaged in the design of score reports for large- 
scale assessments. Some of these constraints and pitfalls have been mentioned above, but they 
bear repeating due to their impact on the design process. 


VVhat Constraints Are Commonly Faced? 


1. Cost. This is usually a constraint for any testing program. More pages, color, and custom- 
ization included in a score report usually results in increased cost. Most states still require 
paper student score reports to be mailed home to parents. VVhen hundreds of thousands 
or even millions of score reports need to be printed and malled, vve are often limited to 
the front and back ofan 8 2 x 11-inch piece of paper and at the most tvvo or three colors 
in addition to black. 

2. Availability of data. Sometimes the information included in the report is limited by the 
availability of the data. For example, a client may vvant to include comparisons to average 
scores for all test takers, but the scores for some test takers may need to be reported before 
others have taken the test. In this situation, one possible solution is to use the data from 
previous years. This solution vvorks vvell vvhen there is not much year-to-year change in 
the performance of the group of all test takers. Hovvever, there often is substantial year-to- 
year change, especially on a nevv or revised test. 

3. Tight schedule. Another limitation vve often need to vvork around is the schedule for 
design. Sometimes vve need to vvork very quickly to create designs in time to meet client 
deadlines, and this time pressure often results in a decision not to gather feedback from 
score report users. In reality, there simply may not be enough time to conduct needs 
assessment and usability testing vvith the intended audience for the score report or score 
reporting system. 

4. Display space. Not enough space is another limitation. It may be necessary to print the 
entire report on tvvo sides of an 8 12 x 11-inch piece of paper. There are often multiple 
subyects and scores to report, vvith text descriptions needed to explain the scores or addi- 
tional information requested by the client. Sometimes score report designers are required 
to include so much information that the only solution is to put the text in very small 
print, making it unlikely that the score report recipients vvill read any of it. The font size 
in a score report should be no smaller than a 9- or 10-point font for the main areas of the 
report, and no smaller than a 7- or 8-point font for minor details or footnotes. 


VVhat Are Some Pitfalls to Avoid? 
Vocal, Mmexperienced “Designers” 


Because a good score report is easy to read and understand, people may believe that it is easy to 
design a score report relatively quickly and that anyone can do it. Sometimes vocal individuals 
vvithout experience in score report design (either on the client side or on the program team) 
insist on using a certain graphic or explanation that they like, instead of one that vvill typically 
be better understood by the end-user. In cases like this, it is quite helpful to be able to point to 
results from market research or a needs assessment vvith the actual users for the testing program 
to support selection of a particular design treatment. 
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Not Involving the Right Decision Makers Early in the Design Process 


This pitfall often results in going back to the dravving board at a İate stage in the design process. 
"These decision makers are often busy people, vvhich makes it difficult to get time vvith them 
to provide input on the score report. Hovvever, it is important to make sure they agree vvith at 
least the general designs before too much time is invested, and to get their approval in vriting if 
possible. VVe had one situation in vvhich a client reyected a design early on in the design process, 
only to come back to us later suggesting that vve try a design similar to the one vve originally 
presented. It turned out that the key decision maker for the client had not been included in the 
early decision to reyect the initial design. In the interim, the design team produced several more 
versions of the report, getting the clients reactions to each version. In that situation, hundreds 
of hours of vvork and vveeks or months can be vvasted. 


Clients Insisting on Including Too Much Text 


Clients are passionate about their assessments and vvant to provide as much information as 
possible into the limited space available. It seems that vvhen there is vvhite space on a page, some 
have an irresistible urge to fill that space vvith vyords. "This is contradictory to vhat score report 
users may vvant or need. İn fact, in one of our focus groups, parents specifically stated that they 
preferred fevver vvords, bullets rather than long sentences, and more vvhite space (Rick et al, 
2016), yet vve are often asked to add more descriptive text vvherever space allovvs. One vvay to 
potentially avoid this problem is to present the client vvith tvvo versions of the score report, one 
that you recommend for better readability and one that includes all of the text that they vvant to 
include. Tt vvould be better yet to present both versions to test score users and get feedback from 
them about vvhich version is easier for them to read and interpret. 


Educational )argon on Score Reports 


Score recipients vvant to knov their score, vvhether they “passed?” hovv they compare to their 
peers, and vvhat they need to do to improve. They dont vvant to be confused by psychometric 
terms or other educational iargon. They may not care about the alignment of scores to stan- 
dards. They usually dont knovr vvhat it means to say that an assessment is “vertically scaled” 
They typically don”t have the concept of a “standard error of measurement” or a “confidence 
interval,” and if you tell them that their scores contain “error,” they vvill vvant to knovv vvhy you 
couldnt get it right. If the report includes any of these concepts, it is important to explain them 
in laymans terms, and those explanations can be difficult to create. 


Clients Insisting on Including Something in the Report That is Against 
Your Professional hudgment 


If your design team disagrees vvith the client about an aspect of the score report, it is vvise to 
document the team misgivings and recommendations in vrriting. Supporting your concerns 
vvith citations from the literature may be helpful (if they exist), but clients may be more svvayed 
by recommendations based on regulations, court decisions, or examples from similar reports 
that vvere favorably vievved by the public. In one case, a client vvanted to use a color that did 
not pass color contrast standards for visual accessibility. In another instance, a client vvanted to 
simplify an explanation to the point that it vvas no longer entirely accurate. Share your concerns 
vvith the client in vvriting, so the client vvill be fully avvare of the issues involved. If you can sub- 
stantiate your concerns vvith results from research evaluating score report user understanding, 
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also include those in the letter to the client. Then let the client decide hovr to proceed. In the 
end, the report belongs to the client. The design team should never include misinformation on 
a report, but no client has ever asked us to do so. VVhat is more common is for a client to request 
that the report include information that the design team does not think is useful. In general, it 
is vyise to document your recommendations to the client. Ifa problem occurs because the client 
decided not to follovv your recommendations, it could be very useful to have a record of those 
recommendations. It could be even more useful if your recommendations to the client vvere 
accompanied by an explanation in vrhich you anticipated the problem that actually occurred. 


"The Role of Score Report Design Research 


Research to determine hovv score report users comprehend and interpret test results could 
potentially help testing organizations to produce better score reports. Hovvever, much of the 
research done so far is based on small samples. Even vvhen sample sizes are adequate, the results 
often depend heavily on the population the participants represent. VVhat vvorks vvell vvith one 
population of score recipients may not vvork vvell vrith another. In practice, score report design 
is most often based on visual design and user experience principles, driven by client prefer- 
ences, vvith little research available to guide specific decisions about the design. Hovvever, there 
has been a noticeable increase in research on the topic of score reporting in recent years, and 
the results may help practitioners to improve score report design in the future. The chapters in 
this volume describe much of this research. 


Conclusion 


In the years that vve have been vvorking as a team to design score reports, vve have learned a fevv 
things, and vve expect to learn more each time vee go through the process. In starting out, vve 
focused our attention almost entirely on the graphics and the text included on the report. VVhile 
these are clearly critical, vve novr realize that documented communication among all involved 
and strict adherence to the schedule for the vvork are fust as important. Getting feedback from 
score report users is a design step that vve are seeing more often, but unfortunately it still is 
often overlooked in score report design budgets. VVe vvill continue to emphasize to our clients 
the importance of obtaining user feedback. Hopefully over time, gathering feedback from score 
report users vvill become more the rule than the exception. In this chapter, vve have offered our 
best advice about the score report design process, based on vhat vee knov novv. In the years to 
come, vve hope to learn hovr to improve the process further. 
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Effective Reporting for Formative Assessment 
The asT”Tle Case Example 


Gavin T.L. Brovvn, Timothy M. O”Leary, and /ohn A. C. Hattie 


Assessment should have a purpose. As Zumbo (2009) stated, in the context of discussing valid- 
ity, “it is rare that that anyone measures for the sheer delight” (p. 66) going on to concede that 
measurement is “something you do so that you can use the outcomes” (p. 66). VVithin educa- 
tional contexts, there are many vvays testing might be expected to used and improve schooling 
(Haertel, 2013), as vvell as many vaays users might anticipate using test results (Hopster-den 
Otter, VVools, Eggen, 6: Veldkamp, 2016). One key use, perhaps the primary use, of educational 
assessment is the support of student learning (Popham, 2000). Given such improvement pur- 
poses for tests, validity requires that reports on student performance be vvell aligned to the test 
(and the test vvell aligned to the intended curricular goals) and vvell designed to ensure under- 
standing (Tannenbaum, this volume). 

In any system that expects teachers to monitor and respond to student learning, teachers are 
important users of test information. In such systems, the teacher5 role is primarily to mediate 
test score information into appropriate instructional decisions (e.g., pace of progress, student 
grouping, task and activity design, selection of curricular resources, etc.). The focus of this 
chapter is on the communication of test results to teachers in vvays that foster interpretations 
and actions that align vvith those intended. Shepard (2001, 2006) makes it clear that most edu- 
cational assessment is carried out in classrooms by teachers and that significant improvements 
are needed in hovr testing might continue to play a part in that process. Teachers are expected 
to make a series of qualitative interpretations about observed student performances, as vvell as 
interpretations of test scores (Kane, 2006). These interpretations occur as teachers interact vvith 
students in the classroom and are not simply recorded for later interpretation. VVhile modern 
directions in assessment design focus on ensuring that a robust theory of learning or cognition 
is present (Pellegrino, Chudovvsky, Glaser, 6ç National Research Council, 2001), it seems more 
appropriate in evaluating test reports for teachers to focus on theories of effective communica- 
tion and instructional action. 

VVithin educational settings, the first goal ofa diagnostic test score report should be to ensure 
that the test reports inform teachers” decision-making about “v/ho needs to be taught v”hat next” 
(Brovvn 6: Hattie, 2012). Extensive research on feedback (Hattie 8: Timperley, 2007) shovvs that 
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in order to close the gap betvveen vvhere students are and intended curriculum goals and stan- 
dards, tests have to describe diagnostically the current status (strengths and vveaknesses) of a 
student and point to action that the teacher and/or the student can take to improve learning 
so as to maximise the probability of attaining the success criteria of the lessons. It is to reduce 
this gap betvveen vvhere they are and vvhere vve vvant them to be that leads to the importance 
Of assessment. This means that effective educational tests have to provide more than total score 
or rank order information. In order to make instructional decisions about curriculum, reports 
need to specify, among other things, hovv scores can be used (AERA, APA, 8: NCME, 2014), 
though relatively little is contained in the Sfandards about ensuring that report readers make 
appropriate interpretations. Test developers seldom provide validation evidence as to vvhat 
report readers see in the reports and vvhat they do vvith the information (Hambleton $: Zenisky, 
2013, Hattie, 2014, Hattie 8: Brovvn, 2010). Yet, it is these tvvo issues vvhich vvill determine if test 
reports contribute to improved outcomes. 

As a consequence, the second goal of such a test report is, or should be, to improve the qual- 
ity of teacher instruction and student learning (Popham, 2000). This agenda has been made 
increasingly explicit vvith greater policy and research emphasis on a variety of approaches to 
assessment including: formative evaluation (Bloom, Hastings, 8: Madaus, 1971), school-based 
assessment (Torrance, 1986), classroom assessment (Crooks, 1988), performance assessment 
(Darling-Hammond, 1994), alternative assessment (Birenbaum, 1996), assessment for learning 
(Black 8: VViliam, 1998), and assessment for teaching (Griffin, 2014). VVhat these approaches 
have in common is that they situate the design, administration, scoring and interpretation of 
evaluative processes in the midst of the instructional environment, rather than external to it. 
"This improvement-oriented process has to take place early enough so as to make a difference to 
outcomes (Scriven, 1991) and is methodologically catholic in that it does not privilege or den- 
igrate tests versus other methods (e.g., performances, portfolios, peer or self-assessment, etc.). 
"These approaches all focus on generating data and decision-making about learning outcomes, 
much in the manner of total quality management (Deming, 1986), by the people closest to and 
directly responsible for educational practices and processes (i.e., teachers and school leaders). 

Parallel to this is the need for reports on test data to reach the teacher soon after the test has 
been administered so that the information is relevant to vvhere the learning vvas vvhen it vvas 
tested. There can be no doubt that a report that arrives from a central test agency some three 
months or so after the test date is unlikely to be valid or effective. As vve have argued before: 


the potential for that information to actually shape meaningful learning activities is prac- 
tically nil—the students have changed class or grade, the teachers have moved on to nevv 
material, the class may have been successfully taught that content, and so on. 

(Hattie 8: Brovvn, 2008, p. 195) 


Prompt feedback to the teacher as to vvhich children have vrhich needs or strengths is a sine 
qua non in ensuring that standardised tests serve educational rather than administrative or pol- 
icy goals. Indeed, another feature of rapid reporting to teachers is the assurance it gives that they 
are the first to read the reports, delayed reporting may have been monitored and inspected by 
superiors before it arrives, more so in yurisdictions that prioritise testing for school accountabil- 
ity. Rapid reporting allovvs teachers early access to both pleasing and disturbing data and the 
chance to respond to it before external stakeholders inspect the results (Brovrn 6: Hattie, 2012, 
Hattie 6: Brovvn, 2008). Hence, rapid reports to teachers connects the test information to their 
current teaching context and raises the probability that teachers vvill actively respond to the data. 

An important policy consideration that vvill support accurate teacher interpretations and 
decisions from test reports has to do vvith consequences or stakes associated vvith the test. 
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Good tests can lead to educators discovering some very discomforting nevvs (e.g., the class or 
school is vvell belovr expectations and averages). In an environment vhere there are negative 
consequences (e.g., league tables), there can be strong incentives to game or cheat the test to 
avoid “unfair” consequences. Hence, a lovv-stakes environment, creating a sense of psychological 
safety, is often needed to ensure “bad” nevvs in a test report is read and acted upon (Hattie 8: 
Brovvn, 2008). Helping teachers embrace the “bad nevvs” of poor scores so that correct diagnosis 
of need and prescription of appropriate instruction are maximised is the legitimate goal of test 
reports. Hence, effective test reporting depends, in part, on the existence ofa non-punitive pro- 
fessional environment anchored on educators using data to improve curriculum, instruction, 
and learning (Lai 6: Schildkamp, 2016). 


Defining Score Reports 


Hambleton and Zenisky (2013), the foremost of contemporary score report theorists, described 
score reports as the vehicle “to convey hovv scores can be understood appropriately in the con- 
text of the assessment and vvhat are the supported actions that can be taken using the results” 
(p. 482). Rankin (2016) defined a score report as communicating data, through tables, graphs 
and vvords in order to achieve a purpose, typically helping to turn data into actionable infor- 
mation, for an intended audience. Thus, score reports are the tangible communication used to 
disseminate scores, vvhich are the summarised results or output of some observable phenomena 
(test performance), to an intended audience. A score report may be a stand-alone single report, 
a series of reports, it may be bespoke or automatically generated, it may be a static online report- 
ing environment or even a dynamic online reporting system. A score report may be any com- 
bination of the above or much more. More than simpİy the manner in vrhich the outcomes of 
testing is reported, score reports are the thin lens through vvhich the outputs from the complex 
process of assessment are communicated to its audience. Indeed, score reports are, arguabiy, far 
more than simpİy the output of assessments: they are part of the assessment they are reporting 
(O”Leary, Hattie, 8: Griffin, 2017b). 

Score reports are then of fundamental importance to the intended outcomes of testing. 
More than simpiİy the afterthought to the test development process, score reports are the 
integral link or interface in the communication betvveen test developers and test score users. 
Effectively, score reports are decision support tools (Dhalivval 8: Dicerbo, 2015) and shoulder 
the responsibility for supporting accurate user interpretation and use of test scores. As such, 
their design should be focussed upon optimising user interpretation and use (Zapata-Ri- 
vera 6: Katz, 2014). Hovr vvell a score report does, or does not, communicate its message and 
subsequently influence the decision and actions of their intended audience is then critical, 
and, arguably, as important to the notion of validity as the other psychometric properties 
traditionally considered vrhen undertaking validation (Hattie 6 Brovvn, 2010). Indeed, score 
reports are, arguably, far more than simply the output of assessments: they are part of the 
assessment itself. 

Accepting that score reports are the mechanism through vvhich performance is conveyed 
to an intended audience, it is evident that score reports are a form of feedback to those receiv- 
ing the reports. VVithin educational contexts, diagnostic or interim assessments are the vehicle 
through vvhich teachers receive feedback about the students in their class to assist in ansvvering 
the question “vo needs to be taught vvhat next (Brovvn 6z Hattie, 2012). In order to close the 
gap betvveen vvhere students are and the intended curriculum goals and standards, tests have to 
describe diagnostically the strengths and vveaknesses of a student and point to action that the 
teacher and/or the student can take to improve learning (Hattie 8: Timperley, 2007). 
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The Challenges of Score Reporting 


Interpretation and use of scores are of critical importance to validation efforts and any sub- 
sequent claims about validity (American Educational Research Association TAERAİ, Amer- 
ican Psychological Association TAPAl, and National Council on Measurement in Education 
İNCMEİ, 2014). Hovvever, interpretation and use of scores does not transpire purely because 
testing occurs. Interpretation and use are the conclusion of the complex process of testing and 
occurs solely because of audience engagement and interpretation of the output of test score 
reports. In fact, hovv vyell a score report does, or does not, communicate its message and subse- 
quently influence the decision and actions of its intended audience is critical and as important 
to the notion of validity as the other psychometric properties traditionally considered vvhen 
undertaking validation (Hattie, 2010). 

Unfortunately, hovvever, validity theory and validation practice rarely incorporate explicit 
references or guidance about hovr to deal vvith the actual (as opposed to the intended) inter- 
pretations made by report users, nor the consequential actions of score users” engagement 
vvith score reports. "The literature on score report design date back almost three decades. That 
literature persistently identifles that test users have difficulty in understanding test scores as 
intended, across a range of report formats (Goodman 86: Hambleton, 2004, Hambleton $: Slater, 
1997, Taeger, 1998, Van der Klei/ 6: Eggen, 2013). The last 25 years has seen significant con- 
tributions to the design of test reports from the information display literature (Bertlin, 1983, 
Cleveland, 1994, Fevv, 2012, Kosslyn, 2006, Tufte, 1990, 2001) VVainer, 1997). For example, Tufte 
(2001) identified seven principles of graph design vvhich are pertinent to any effort to represent 
test scores graphically: 


1. Shovr the data 

2. Direct the reader to think about data being presented rather than some other aspect of 
graph 

Avoid distorting the data 

Present data using the minimum of ink 

Make large data sets coherent 

Encourage the reader to compare different pieces of data 

Reveal the underlying message of the data. 


ı ou 


As a consequence of significant vvork (Aschbacher 8: Herman, 1991, Hambleton $: Slater, 
1997, Hambleton 8: Zenisky, 2013, Hattie, 2010, Tmpara, Divine, Bruce, Liverman, 6: Gay, 
1991, laeger, 1998, Linn 6: Dunbar, 1992, Rankin, 2016, Zapata-Rivera 6: Van VVinkle, 2010, 
Zenisky 6: Hambleton, 2012, 2015), there has been an evolution of guidelines relating to score 
reporting. These guidelines have been integrated vvith explicit notions of user validity (Maclver, 
Anderson, Costa, 6z Evers, 2014) and of score report interpretability as an aspect of validity (Van 
der Kleif, Eggen, 6 Engelen, 2014). The ongoing advancement of score reporting guidelines 
has seen a progression from recommendations about vhat and hovr to produce score reports 
through to iterative design methodology (Hambleton 6: Zenisky, 2013, Zapata-Rivera 6: Van 
VVinkle, 2010). 

Hattie (2010) enunciated 15 principles for the design of test reports vvhich align in part vvith 
Tufte and also extend to address issues arising vyhen test reports are embedded vvithin softvvare 
systems. For example, he recommends in accordance vvith Tufte that reports (Principle 6) mini- 
mise the amount of “numbers” and maximise the amount of interpretations, (Principle 8) have a 
ma?or theme, (Principle 10) minimise scrolling, be uncluttered, and maximise the “seen” over the 
“Tead” In terms of deploying test reports vvithin a softvvare system, he recommends (Principle 3) 


Effective Reporting for Formative Assessment ə 111 


that readers of reports need a guarantee of safe passage from vvhere they are in the system to vvhere 
they vvant to go and (Principle 4) report readers need a guarantee of destination recovery: that is, 
the system must intuitively allovv them to navigate among the various reports and tools vvithin the 
human-computer interface. He also recommends (Principle 7) that reports be restricted in the 
amount of information displayed (i.e., the ansvver is never more than 7 plus or minus 2). 

Current best practice is captured in the Hambleton and Zenisky model (2013) and compre- 
hensively described by Zenisky and Hambleton (2015) in the Handbook of Test Development 
(Lane, Raymond, 8: Haladyna, 2015). "This model is an iterative process of score report devel- 
opment and refinement. The process is conceptualised as a four step or phase model. The first 
phase is about laying an appropriate ground vvork. The second phase is about report develop- 
ment. The third phase is about field test and redesign. Finally, the fourth is about evaluation and 
maintenance. One of the key aspects that makes this model best practice is that it is focused on 
an ongoing process of improvement and refinement and not simpiy static guidelines. Consistent 
vvith the Hambleton and Zenisky model, Hattie (2010) recommended that (Principle 1) the 
validity of test reports be determined by the readers5 correct and appropriate inferences and/or 
actions in response to the report, (Principle 2) evidence be obtained to demonstrate hovv read- 
ers interpret reports, (Principle 5) the focus be on maximising interpretations not displaying 
numbers, (Principle 11) reports be designed to address specific questions, (Principle 12) pro- 
vide fustifications that the test is fit for the specific applied purpose, and (Principle 15) reports 
be thought of as actions to take, not yust screens to print or store. 

VVith an eye tovvards the Hattie (2010) principles, O”Leary (2017) has proposed amendments 
to the Hambleton and Zenisky (2013) model, aimed at providing more explicit articulation to the 
evaluation phase of their model. The goal of those recommendations is to direct the collection 
of evidence concerning user comprehension of score reports. Tvvo overarching design principles 
of evaluation (i.e., utility and clarity) vvith seven sub-domains have been promulgated (Table 8.1, 
O”Leary, Hattie, 8: Griffin, 2016b). Utility requires that score reports are designed vvith a clear 
purpose, actions, and outcomes in mind, vvhile clarity expects score reports to be designed so 
that they are easily comprehensible to the target audience. These align vvith the proposed forms 
of validity evidence put forth by O”Leary, Hattie, 6: Griffin (2016a, 2017a). The purpose of these 
principles is to provide an outcomes focused lens through vvhich score reports are considered. 
A rubric for evaluating the alignment of score report construction against these criteria has been 
developed (O”Leary, Hattie, 6: Griffin, 2016b) and subsequent empirical vvrork has demonstrated 


Table 8.1 Empirically derived design principles for outcomes focused evaluation of score reporting. 


Utility 

Purpose The purpose of a score report must be explicit. 

Interpretation The intended interpretations of scores must be explicit. 

Actions The intended consequences or actions of interpretation must be explicit. 

Clarity 

Design Features The design of score reports must be based upon current best practices inclusive of 
contemporary examples of best practices and guidelines and recommendations from 
vvithin the literature. 

Interpretive Guidance Score reports must be designed to be stand-alone aiming to minimise additional vvork 
or tasks that are required to fully interpret the reported information. 

Displays Score reports must integrate multiple forms of data representation. 

Language The language used in a score report must be easily understood by the intended 


audience. 


112 ə Gavin T.L. Brovrn, et al. 


that the rubric is a reliable tool for obtaining evidence that “better” designed score reports more 
eflectively communicate their intended message (O”Leary, Hattie, S: Griffin, 20179). 

"These standards can be used to evaluate any test-based reports as attempts to communi- 
cate expert information to a lay end-user audience. Ünsurprising then, the role of timing is 
not explicit. A separate argument about the importance of rapid or delayed reporting needs 
to be made but vrhich could be subsumed under the notion of validity. As vve have already 
indicated, test reports vvhich are not available to teachers soon after test administration can- 
not guide instruction. Further, in light of quality management principles, providing additional 
insights about directions for improvement needs to be timely and for classroom teachers timely 
is next Monday or tomorrovv, not three months from novv. Hence, vvell-designed test reports 
that arrive too late to make a difference are of little value to formative practice. The advantage of 
pure “in-the-head” and “in-the-moment formative interactions betvveen teachers and students 
(Svvaffteld, 2011) is that it happens immediately, vvhile the need or opportunity is evident. Such 
interactions may be more error prone than tests, but they are immediate. Matching this timely 
facet matters to teachers and teachingş delayed reports are fundamentally purposeless. 


"The Audacious Example of asTTle 


To illustrate the principles enunciated in this chapter, it is constructive to examine the devel- 
opment of Nevv Zealands Assessment Tools for Teaching and Learning (asTTle) test system 
(Hattie 6: Brovrn, 2008, Hattie, Brovvn, 6c Keegan, 2003, Hattie, Brovvn, Keegan, et al., 2004). 
The asTTle test system is an online standardised test system for reading comprehension, math- 
ematics and vvriting (and the Maori language equivalents) used in Nevr Zealand primary/ele- 
mentary and secondary/high schools. The test materials have been calibrated to Levels 2 to 6 of 
the Nevv Zealand national curriculum (Ministry of Education, 2007) and norms are available 
for students in grades 4 through to 12 (nominally ages 8 to 17). The asTTle system consists of: 


1. an item bank of over 20,000 curriculum-obiective and level calibrated and difficulty-cali- 
brated multiple-choice and open-response tasks, 

2. a teacher-controlled test design engine, 

3. an automated test scoring engine that converted 1PL Rasch item scores to performance 
on achievement obiectives and Curriculum Levels, 

4. a reporting engine that permitted selection from a range of test reports concerning group 
and/or individual performance, and 

5. an online catalogue of teaching resources indexed to the test reporting system. 


"This system vvas created in a policy environment that prioritised diagnostic testing for the 
explicit purpose of informing improved instruction and student learning outcomes (Ministry 
of Education, 1994). Indeed, the official policy and the rhetoric used around the research and 
development phases of asTTle made explicit that using the test system vvas a İov-stakes activity: 
use vvas not required nor vvas reporting to government and there vvas no centrally determined 
test administration (Hattie 6: Brovvn, 2008). Furthermore, as vvas made clear by the Ministry 
of Education (2010), the system vvas designed to inform and support teachers by giving them 
access to externally-referenced norms and diagnostic curriculum-aligned reports, rather than a 
mechanism to be used by the Education Revievv Office or the media to fudge or evaluate teach- 
ers and/or schools. This ensures that generation of data about student learning vvas done in a 
non-punitive manner: the goal vvas to inform improvement, vvrhile generating data that allovved 
teachers to understand hovv their ovvn students compared to similar students dravvn from a 
robust national norming (Brovn 6: Hattie, 2012). 
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Nevv Zealand primary school teachers tend to make extensive use of standardised diagnostic 
testing, especially at the beginning of the school year to inform vvithin-class grouping (Crooks, 
2010). It is important to note that none of the standardised tests available through the Nevv 
Zealand assessment “tool box” vvere compulsory or nationally administered as a national test 
(Brovvn, Irving, 6: Keegan, 2014), the use of all tests vvas completely voluntary vvith data retained 
at the school level, Hovvever, the standardised tests available before asTTle vvere general ability 
tests and reported only total score and rank order performance information. These limitations 
vvere overcome in asT"Tle because the system (a) allovved testing at any time, (b) allovved teach- 
ers to customise tests to classroom teaching, (c) calibration allovved different tests to be com- 
pared over time and over classes, and (d) reported performance on curriculum achievement 
obyectives and levels, as vvell as normative performance. 

Hence, the overall goal in designing the asTTle test report system vvas to give teachers a 
sufliciently accurate portrayal of student strengths and vveaknesses so that teachers could make 
appropriate decisions about vv/o needs to be taught vvhats next. "This meant that the level of 
accuracy required in reporting a score vvas determined by vvhether the teacher vvould make a 
defensible decision about curriculum materials, pedagogical activities, or student grouping. In 
a sense, a principle similar to Goldilocks vvas used in that there vvere really only three options: 
curriculum content and material too easy, fust right, or too hard. Since teachers already have a 
reasonably accurate sense of rank order vithin a class of students and have already made yudge- 
ments about the curriculum level vvhich they are teaching, a good test report system vvould have 
to go beyond this extant information. A good system vvould have to tell teachers something that 
they did not already knovv, teachers should be surprised rather than comforted. 


Alpha Testing 


To achieve this, a series of teacher intervievvs and focus groups vvere conducted early in the sys- 
tem development process to determine the administrative and educational goals that teachers 
and school leaders had for an assessment event (Meagher-Lundberg, 2000) (Note, all asTTle 
Technical Reports are signifted ” in the reference list). That research identifted that information 
comparing their ovvn students to national norms vvas desired in order to report to a range of 
stakeholders (e.g., parents, trustees, staff), to inform school and staff self-appraisal and pro- 
fessional development, to plan teaching, and to target resource commitments. Additionally, 
teachers vvanted descriptive information relative to curriculum levels and achievement obiec- 
tives. VVith this information, the design of test reports vvas initiated. This involved collaborat- 
ing vvith a graphic artist vvho created simulated screen shots for a set of report templates that 
might achieve the various goals identifted by the teachers and vrhich vvere deemed to be feasible 
through an obiyectively-scored test. These report templates vvere iteratively presented to teacher 
focus groups (Meagher-Lundberg, 2001a, 2001b) to ascertain that teachers could make the 
intended interpretations. Initial reaction to the designs indicated a strong need for clarity as to 
hovr navigation betvveen reports vyould be conducted. Teachers indicated initial designs lacked 
clarity as to the meaning of various communicative devices such as coloured fields depicting 
normative information, arrovvs, dials, numeric scales, labels, and the position and salience of 
explanatory terms (Meagher-Lundberg, 20014). 

In light of this feedback, further revisions vvere created taking advantage of graphical com- 
munication insights obtained from the research literature (Brovvn, 2001). The navigation 
problem vvas successfully addressed subsequently by placing the report images in a brovvser 
vvindov (Meagher-Lundberg, 20015): taking advantage of existing end-user preferences and 
knovvledge about hovv softvvare operated (Spolsky, 2001). Changes to the devices used to com- 
municate information vvere generally successful according to the second focus group. This led 
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to the design of a report engine that included a menu system to navigate to one of the follovv- 
ing report templates: (a) a group or cohort achievement comparison console: (b) individual 
and group “kid maps” and (c) a curriculum level achievement “skyline” shovving proportions of 
group performing at each level, Additional features for reporting cognitive processing against 
the SOLO taxonomy (Biggs 8: Collis, 1982) and attitudes tovvards tested subiects vvere identi- 
fled for integration into either individual or group reports. Note that the sample reports shovvn 
belovr (Figures 8.1 to 8.4) are from the current e-asTTle version (e-asTTle Proyect Team, 2009). 
As can be seen from the follovving example reports, each report provided interpretive guidance 
on-screen, addressed a single, clear educational purpose, vvith the goal of supporting actual 
teacher decision-making. 

Note that the achievement cohort comparison console (Figure 8.1) drevv heavily in its design 
on the previously deployed CRESST Quality School Portfolio report (Baker, 1999). That report 
used a series of gauges and dials to capture various quality aspects of schools (e.g., safety, tech- 
nology, attendance, standardised test performance, etc.) and made use of traffic light colours to 
indicate level of concern (i.e., red— belovv average: yellovv-average: greenzabove average). This 
report vvas intended more for the cohort, sub/ect, or school leader vvho needed an overviev of 
performance relative to national normative performance. In e-asTTle, a key is used to remind 
the reader that the normative performance of the related comparison group is the edge betvveen 
the blue field and the vvhite space and the performance of the tested group is shovvn as red 
pointers or box plots. Any aspects of the curriculum not covered by the test are greyed out to 
focus interpretation on the aspects for vvhich there vvas sufficient information on vvhich to base 
decisions and actions. Because it may be unfair to compare “my school” vvith the vrhole nation, 
especially if my schools population is dravvn from either the tails or tops of the socio-economic 
distribution, users are able to specify the type of comparative norm by selecting either student 
(e.g., sex, ethnicity) or school information (i.e., school cluster). The point of this selection is not 
only to drill dov into performance of students meriting specific attention but also to remove 
the obstructive claim that my students cannot achieve because they are disadvantaged: if the 
average for similar students or school types is higher than ones ovvmn, then it does not hold that 
such factors in and of themselves prevent improvement. 

Likevvise, the “kidmap” reports (Figure 8.2) dravv on the vvork of VVright and Stone (1979) in 
vrhich performance is classifted into one of four spatial ftelds or categories: that is, (a) correct 
and easy, (b) correct but hard, (c) incorrect but easy, and (d) incorrect and hard. This is achieved 
through comparison of student accuracy on the item (i.e., correct vs. incorrect) according to 
the difficulty of the item relative to the students overall performance. Rather than listing items 
in each space, the asTTle system reports achievement obiectives in each field, supplemented 
by item numbers in order to maximise attention on the teaching of learning outcomes rather 
than test items. Clearly, this report vvas designed for the classroom teacher or counsellor vvho 
needed to discuss vvith a parent or guardian the specifics of an individual child. The report uses 
the same conventions as the Console report to indicate overall performance relative to the same 
grade level norm both in terms of subiect performance and motivation or attitudes. This per- 
mits partnership discussions betvveen teacher and parent vvith the student to identify priorities 
for both vvork at home, as vvell as vvork in class. 

To cater for the reality that teacher planning has to address groups of students (e.g., classes, 
grade cohorts, or special categories), the individual kidmap learning pathvvays report vvas trans- 
posed using the same colour coding to point teachers to the proportion of students having 
strengths, mastery, vveaknesses, or gaps according to curriculum obiectives (Figure 8.3). To enable 
priority-making decisions, teachers had only to look for obyectives for vvhich the blue space (i.e., 
to be achieved) vvere large and those in vrhich the green space (i.e., achieved) vvere large. The 
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former indicated content that a high proportion of students needed to be taught, vrhile the latter 
indicated material on vvhich a high proportion needed no further instruction or practice. 

Unsurprisingly, teaching to the mean vill disguise the distribution of performance. Hence, 
the system provides a distribution of performance report (i.e., Curriculum Levels Report: Fig- 
ure 8.4), vvhich reveals both central tendency and distribution. Because Nevv Zealand primary 
school teachers practice considerable vvithin-class ability grouping, each “skyline, vvhen selected, 
displays the names of students in each performance group. This allovvs teachers to move chil- 
dren into different grouping combinations according to identified needs, rather than create 
persistent groups across all learning areas. In fact, this ability to differentiate for grouping vvas 
noted by early adopting teachers and their students as a positive facet of the system (Archer 6t 
Brovvn, 2013). 

VVe suggest that the suite of reports and the ability to customise those for the multiple pur- 
poses of classroom teachers and school leaders meant that the system complies vvith the expec- 
tations of good reporting outlined in Table 8.1 and earlier. 


Beta Testing 


Having established through “alpha” testing reasonabİy robust communicative test reports, these 
designs vvere further refined through “beta” feedback from (a) Ministry of Education officials 
vyho vvere the funders and sponsors of the asTTle system, (b) the softvvare engineering team 
vyho advised on feasibility and cost of various design options, (c) pilot testing by teachers vvho 
vvere exposed to a mock-up of the system, and (d) acceptance testing of asTTle version 1, con- 
taining materials for reading and vvriting only, vvhich vvas deployed to 110 primary schools. As 
each stage of beta testing vvas conducted, formative changes vvere made to the asTTle system to 
achieve the curricular goal of helping teachers knovv vvhat to teach to vvhich students. 

The evaluation of the pilot implementation of asT"Tle (v1) into 110 Nevv Zealand schools 
(VVard, Hattie, 6: Brovvn, 2003) used a survey to ascertain, among others, the ability of teachers 
to accurately interpret asTTle reports. The survey included a set of report reading comprehen- 
sion items, partially inspired by Hambleton and Slater (1997) and Linn and Dunbar (1992). 
Results indicated that in general, the Console Reports and the VVhat Next reports had reason- 
ably high levels of correct interpretation, vyhereas the means vvere much İovver for Individual 
Learning Pathvvays and Curriculum Levels reports (Hattie, Brovrn, VVard, Trving, 6: Keegan, 
2006). These results vvere incorporated into a structural equation model as dependent variables. 
The model proposed that attitudes tovvards computers, ICT, assessment and professional devel- 
opment veould predict the level of involvement teachers had vvith the asTTle test system, vvhich, 
in turn, vvould predict the teacher evaluation of asTTle and their ability to ansvver the report 
reading comprehension questions. Indicating a belief that assessment is povverful for improving 
teaching, rather than for evaluating schools, and seeing the asTTle softvvare as positive vvere 
clear predictors of accuracy in report interpretation (Hattie et al,, 2006). The maior messages 
vvere that professional development needed to be oriented most tovvards encouraging a positive 
attitude tovvards using ICT-based assessment as part of teaching and learning. This information 
vyas used to improve the quality and quantity of professional development resources supplied 
to asTTle users by the Ministry of Education. İt vvas also used to indicate vhat should be in the 
professional development—clearly teachers needed assistance in accurately understanding and, 
thus, using asTTle correctly. 

Based on these studies, the Ministry of Education funded for several years multiple mecha- 
nisms to support teacher learning in making use of the asTTle system. A free-phone technol- 
ogy-oriented help desk vvas deployed so that callers using asTTle and e-asTTle on their local 
vyork-stations, school-based servers, and eventually the internet could have prompt help. VVhen 
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installed, the asTTle system provided user manuals and technical reports to support under- 
standing and use of the system. These documents veere also available online from the Minis- 
try repository of reports and documents (http://e-asttle.tki.org.nz/). Throughout the nation, 
assessment-focused teacher professional development teams (Assess to Learn: AtoL) vvere com- 
missioned and funded to provide vvithin school services focused on the İlogic of using the asTTle 
reports to improve and guide instruction and reporting. The asTTle Proyect development team 
provided initial briefing to AtoL teams but vvas explicitly excluded from the delivery of school- 
based training. Nonetheless, it vvas apparent that the effectiveness of AtoL teams depended, in 
part, on the existing conceptions teachers had of the purpose of assessment—the more they 
considered assessment vvas for accountability, the less use they made of asTTle for improvement 
(Brovrn öz Harris, 2009). 


Extension to Secondary/High School 


"The original requirements brief for the asTTle system vvas focused solely on primary/elementary 
schooling: that is, Curriculum Levels 2 to 4, vvith norms for students in Years 5-8 only. Hovvever, 
given the success of asTTle v2 in primary schools, the Ministry of Education received vigor- 
ous requests from secondary/high school teachers and their union, the Post-Primary Teachers 
Association for extension of the system to include their students (Brovvn, 2013). The logic vvas 
reasonably simple: although the curriculum framevvork expects that Level 4 vvill be completed 
by the end of primary schooling (Year 8), empirical realities are such that many students arrive 
at high school still functioning at Levels 2 to 4 (Satherley, 2006). An environmental constraint in 
secondary schooling, not present in primary schooling, is the important role secondary schools 
play in preparing students for and administering formal qualifications assessments (i.e., the 
National Certificate of Educational Achievement-NCEA) (Crooks, 2010). 

The NCEA begins vith Level 1 in Year 11, culminating in Year 13 vvith Level 3. Nominally, 
NCEA Level 1 is equated to Curriculum Level 6, though some achievement obiectives for Level 
6 are taught in Year 12 rather than Year 11. The NCEA system evaluates student learning using a 
criterion-referenced, standards-based grade system (i.e., Not Achieved, Achieved, Merit, Excel- 
lence), somevvhat akin to more conventional letter grade systems (i.e., D/F, C, B, A). NCEA 
also structures the curriculum obiectives around units of vvork knovvn as standards, this means 
that alignment of test items to NCEA standards might be of value to secondary teachers. This 
high-stakes evaluation system predominates educational assessment in Nevv Zealand secondary 
school systems and so the possibility that the asTTle reports could be modified to accommodate 
this alternative system to the curriculum levels framevvork vvas explored. Additionally, vvithin 
the framevvork of beta testing asTTle v3 in 55 secondary schools, accuracy and sufliciency of 
the test reports vvas conducted through a mixture of surveys, telephone intervievvs, and focus 
groups (Hattie, Brovn, İrving, et al., 2004). 

As reported in Hattie et al, (2006), secondary school teachers vvere positive about the asTTle 
reports, expressing satisfaction vvith the amount of detail on reports and the relevance of the 
reports to their needs. Teachers reported significant help from the formative and diagnostic 
reporting functions at both aggregated and disaggregated levels of reporting, especially in the 
Group Learning Pathvvays Report and the Individual Learning Pathvvays Report. In addition, 
they found benefit from the aggregated data in the Tabular, Curriculum Levels, and Console 
reports. Several enhancements based on feedback on asTTle v2 vvere evaluated positively. For 
example, instead of yust reporting group means on the console report vvith an ellipse (i.e., MH/- 
se), a box-and-vvhisker plot shovved the distribution of scores for the group being reported. The 
display of the national norm score as a coloured fteld vvithin the dials instead of as a number 
belovr vvas also seen as an enhancement. 
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Secondary teachers indicated value in tvvo nevv types of report. Focus group participants indi- 
cated value in longitudinal reports that shovved hovv individuals or cohorts had been progressing 
over time. VVhile the asTTle system had already included the ability to compare scores to similar 
students (i.e., schools like mine, Hattie, 2002), the ability to compare performance to different 
rather than similar categories (e.g., higher performing clusters or ethnicities) vvas seen as valuable. 

Nonetheless, secondary teachers indicated significant concern about the correct or accu- 
rate interpretation of the asTTle reports. These concerns vvere obtained both from the Minis- 
try Telephone Helpdesk as vvell as directly from the evaluation study. Confidence that reports 
vvere being understood and acted upon appropriately mattered to the teachers and needed to be 
addressed through modifications to the Ministry professional development support services 
and asTTle documentation. Although most of the information sought by asTTle V3 users about 
report interpretation vvas available through the PDF manuals included vvith the asTTle V3 soft- 
vvare, it vvas decided to develop an online tutorial system on understanding asTTle reports that 
could be used by individuals or schools as a supplement or alternative to professional develop- 
ment (Hattie, Brovvn, Irving, MacKay, 6: Campbell, 2005). Unlike later online tutorials that used 
video (Zapata-Rivera, Zvvick, 6: Vezzu, 2016), these tutorials vvere slide presentations vvith voice 
over scripted dialogue that could be controlled by the user. 

Lack of alignment to the NCEA system beginning in Year 11 meant that most secondary 
teachers had implemented asTTle V3 vvith students only in Years 9 and 10. Hovvever, vvhen 
shovvn the possible reports and tests that asTTle might be able to generate for them as indicators 
of NCEA performance, teachers vvere quite enthusiastic. Teachers indicated that it vvas import- 
ant or very important to knovv hovr the curriculum-level indexed items in an asTTle test related 
to the NCEA system. Of special interest to the participants vvould be the ability to create a test 
aligned to the various standards of the NCEA system, rather than to the achievement obiectives 
of the curriculum. Despite the strong endorsement of the sampled teachers for adiustments of 
the asTTle reports to align vvith the official qualifications framevvork, the Ministry of Education 
sponsors declined to fund such research or developments. Perhaps, because the NCEA system is 
administered by a separate quasi-autonomous body (i.e., Nevv Zealand Qualifications Author- 
ity, NZQA), such a development funded by the Ministry may have been seen as a breach of 
NZQASs autonomy and responsibility. 

Aside from any systemic “turf” issues, this last point raises some interesting challenges around 
alignment of formative and summative purposes. İt may be that refusal to adapt asTTle to align 
vvith high-stakes qualifications system vvas that this may constitute a threat to the intention that 
asTTle serve goalsrelated to diagnostic formative improvement of teaching and learning (Brovvn, 
2004). VVhen teachers perceive that assessments are for accountability purposes, our ovvn studies 
have found that it is a rare teacher vrho can balance the tension betvveen improvement of my 
teaching and evaluation of school quality (Brovvn 6: Harris, 2009). Indeed, this tension betvveen 
the purposes or goals of assessment for improvement and assessment for accountability seems 
to remain more or less unresolved (Barnes, Fives, 6: Dacey, 2015, Bonner, 2016). As long as test 
systems are used by interested stakeholders to evaluate the vvork of teachers and school leaders, 
vve can expect that there vvill be greater attention and effort paid to raising scores than improving 
instruction (Nichols 6: Harris, 2016). Hence, vvhile the technical capacity exists to align forma- 
tive and summative systems and information into a single test reporting framevvork, the real 
obstacles lie in political factors that may subvert vvell-meaning integration. Unless policy makers 
are vvilling to partner vvith the teachers and respect their legitimate concerns by attaching lovv- 
stakes to tests, it seems highly implausible that improvement and accountability can be effective 
bed-partners. Indeed, as long as high-stakes testing or examination or school accountability 
testing dominate the educational landscape, policies to support or require formative assessment 
are unlikely to be seen as “the real thing (Kennedy, Chan, 8: Fok, 2011). 
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Conclusion 


This chapter has outlined the mafor challenges that face developing and validating test reports 
for teachers. The field has developed a reasonably robust understanding of vvhy this has to 
be done and hovr it can be done. Hovvever, fevv test systems have conducted such time and 
resource-consuming programmes of formative evaluation and documented them as has the 
Nevr Zealand asTTle system. This system is an exemplar of hovv accuracy in interpretation of 
reports and subsequent actions can be established. Clearly, the field needs more such studies 
that establish the validity of tests for use by communities of educators, each of vvhich share dif- 
ferent standards and approaches to assessment, İCT, and schooling in general. 
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Appiying Learning Analytics to Support 
İnstruction 


Mingyu Feng, Andrev Krumm, and Shuchi Grover 


This chapter highlights the vvays in vvhich learning analytics can be used to better understand 
and improve learning environments, instruction, and assessment (Siemens 6: Long, 2011). As 
a set of approaches for engaging in educational research, learning analytics and educational 
data mining represent relatively nevv modes of inquiry. The grovvth of these approaches maps 
closely to the availability of nevv forms of data being collected and stored in digital learning 
environments, administrative data systems, as vvell as sensors and recording devices. Moreover, 
the grovvth of these fields maps closely onto vvhat the National Science Foundation refers to 
as “data-intensive research,” vvhich encompasses more than learning analytics and educational 
data mining to include a broad range of social and physical sciences. As nevv forms of data 
have emerged (i.e., transaction level data from digital learning environments as vvell as digital 
forms of audio, video, and text) and been collected at ever increasing scales, there has been an 
explosion of efforts to make use of these data for the purposes of research. By and large, most 
early vvork beginning in the mid-2000s vvas directed at exploring research questions that vvere 
tractable vvithin highly structured, vvell-designed digital learning environments like intelligent 
tutoring systems (ITS, e.g., Koedinger, Anderson, Hadley, öc Mark, 1997, VanLehn et al., 2005). 
The tight alignment betvveen the learning tasks students vvere expected to engage in and the data 
that vvere collected in these environments made them ideal for exploring not fust the outcomes 
of learning but the various vvays in vvhich students engaged in learning activities. A basic insight 
from these early researchers continues to fuel research and efforts to improve instruction—data 
on students” learning processes is as useful and sometimes more so than data on students” learn- 
ing outcomes. 

In this chapter, vve expand upon this insight and highlight the vvays in vvhich data from digi- 
tal learning environments, administrative data systems, and sensors as vvell as recording devices 
can be used to support instruction in real classrooms by reporting on students” learning activi- 
ties through various data products (e.g., dashboards). VVe do so across four cases that represent 
varying degrees of proximity to instruction. By highlighting these varying degrees of proximity, 
vve intend to demonstrate the multiple vvays in vvhich learning analytics can be used to sup- 
port instruction. Cases 1 and 2 describe efforts to use learning analytics to support instruction 
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through partnerships that bring researchers closer to practice and practitioners closer to the 
vvork of analytics. Case 3 describes hovr process data from digital learning environments can 
be used to develop better assessments of learning that can be used to organize better learning 
opportunities for students. Case 4 describes hov providing practitioners vvith access to carefully 
designed data products and dashboards can help them in making more timely and targeted 
decisions. "The data produced in each of these cases are shared vvith stakeholders in different 
vvays including online report systems or dashboards (Corrin, this volume). 


Overvievv of Cases 


Case 1 vvith Summit Public Schools and case 2 vvith the Carnegie Math Pathvvays are based in an 
approach to using analytics that Krumm and colleagues refer to as collaborative data-intensive 
improvement (CDI, Krumm, Means, 8: Bienkovvski, 2018). Collaborative data-intensive improve- 
ment is an approach that combines tools and routines from improvement science, data-driven 
decision-making, as vvell as learning analytics and educational data mining. The overarching goal 
of this approach is to provide a structured vvay for researchers and practitioners to vvork together 
around identifying a question to pursue, analyzing complex datasets, developing change ideas, 
and testing change ideas in İocal learning environments. Across multiple partnerships, Krumm 
and colleagues (2018) identifted a series of phases and supporting conditions for using data from 
digital learning environments and administrative data systems to improve learning environments 
and support instruction. Phase 1 of a collaborative data-intensive improvement profect involves 
setting up a partnership, vvhich includes identifying participants and ?ointly defining the aim of 
the partnership (Bryk, Gomez, Grunovv,, 6: LeMahieu, 2015). The second phase entails developing 
a practical theory for hovv the partnership vvill reach its aim (Bennett 6: Provost, 2015: Yeager, 
Bryk, Muhich, Hausman, 6: Morales, 2013). Phase III centers on data vvrangling, exploration, 
and modeling (VVickham $: Grolemund, 2017). Phase TV builds on insights from data-intensive 
analyses in the form of co-developed change ideas, and lastly, Phase V is vrhere members of a 
partnership iteratively refine change ideas in real classrooms over time. Cases 1 and 2 describe the 
partnerships from vvhich many of these phases vvere identified (Krumm, 2017). 

Case 3 is situated in the context of introductory programming and computational thinking 
(CT), a nevv skill that seeing rapid adoption at all levels of school curricula as part of nation- 
vvide efforts to support “Computer Science for AİF” (The VVhite House, 2016). There is a grovving 
need to measure students” learning of computational thinking in the context of the complex 
problem-solving processes inherent in programming, and also support all learners through this 
process of learning computational problem solving. Given that there are fevv examples of using 
learning analytics to measure students” learning in open-ended programming environments that 
are popularly used in K-12 classroom, Grover and colleagues push into the emerging realm of 
computational psychometrics (von Davier, 2017) for detection of student behavior for forma- 
tive assessment (Black 6: VVilliam, 2009, Heritage 6: Popham, 2013). They explored hov prin- 
cipled, top-dovvn, approaches of measuring complex skills can be combined vvith bottom-up, 
data-driven learning analytics approaches for better interpretation (of data logs from such pro- 
gramming environments), and consequently better measurement of computational thinking 
practices and programming processes (Zapata-Rivera, Liu, Chen, Hao, 6: von Davier, 2016). 
Based on learnings from analyzing data logs from --300 students using the Alice programming 
environment, they developed a framevvork (Grover et al,, 2017) that formalizes a process vvhere 
a hypothesis-driven approach informed by Evidence-Centered Design effectively complements 
data-driven learning analytics in interpreting students” programming process and assessing com- 
putational thinking in block-based programming environments. The framevvork is shared here, 
as vvell as a brief description of the application of the framevvork on an ongoing research profect. 
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Case 4 is based in recent instructional reforms that address the importance of administrator 
and teacher making use of student assessment data to inform decisions about curriculum and 
instruction (Means, Padilla, DeBarger, 6: Bakia, 2009) and thus making instructional practices 
more effective (Mandinach 8: Gummer, 2016). The advances in technology and its popularity 
in schools have made it easier to collect student performance data. Learning analysts build 
dashboards and a variety of reports to incorporate information from such data, together vvith 
other possible sources of data, and present them to teachers. Although there have been many 
different types of dashboards built, fevv studies have shovvn evidence that teachers make use of 
information presented on the dashboards and adfhust instructions accordingly. Case 4 describes 
an online homevvork support tool that vvas implemented in 44 schools for tvvo years during a 
large-scale efficacy trial. Data collected during the study suggested that teachers implementing 
the intervention made substantial shifts in their approach to homevvork revievv and instruc- 
tional practice more broadiİy. 


Case 1: Data-Intensive Research-Practice Partnership 


The partnership vvith Summit Public Schools (Summit) began in the fall of 2014 vvith the goal 
of developing a research-practice partnership around data collected and stored in multiple 
online learning systems used throughout the charter management organization. The partner- 
ship included researchers and practitioners from multiple organizational levels at Summit. The 
partnership built on the ideas of (a) learning directly from practitioners about the problems 
they experience in their day-to-day vvork: (b) yointly analyzing and interpreting data generated 
by students in digital learning environments to solve practitioner-identifled problems: and (c) 
co-developing ideas for changes informed by multiple data-intensive analyses. 

Around the same time as the fields of learning analytics and educational data mining 
vvere coalescing, nevv partnership models for engaging in educational research vvere emerg- 
ing under the banner of research-practice partnerships (e.g., Coburn, Penuel, 8: Geil, 2013). 
Nevver forms of data combined vith nevvly developing models of research served as the pri- 
mary building blocks for the partnership. The research goals for the proyect included (a) using 
Summits increasingly diverse and sizable datasets to ansvver their ov research questions, 
and through engaging in these analysis activities, (b) develop a generalizable set of tools and 
routines for engaging in collaborative data-intensive research that other partnerships could 
use. To accomplish these research goals, vve (i.e., Krumm and colleagues) used a design-based 
research approach (e.g., Cobb, Confrey, diSessa, Lehrer, 8: Schauble, 2003). A central feature 
of design-research is that it represents a mode of inquiry that seeks to build theory through 
directİy intervening on learning environments (e.g., Barab 8: Squire, 2004: Bell, 2004). To 
inform our design, development, and intervention activities, vve used theory and prior research 
from data-driven decision-making (e.g., Boudett, City, 8: Murnane, 2013), research-practice 
partnerships (e.g., Coburn, Penuel, $x Geil, 2013), and learning analytics as vvell as educational 
data mining (e.g., Baker 6: Siemens, 2014). 

Students in Summit consistently interact vvith digital learning environments in all grades and 
subfect areas. This level of interaction results in a large volume of structured data on both vhat 
students are doing as they engage in learning tasks and hov  vvell they perform on those tasks. 
Summit believes that every student is capable of being college and career ready and that person- 
alized learning opportunities can help students build necessary knovvledge, habits, and skills. To 
accomplish this, Summit developed a vrhole-school approach vvhere students engage in (a) pro/- 
ect-based learning, (b) personalized learning time, and (c) one-on-one mentoring vvith teachers. 
Across these three learning opportunities, vvhich span all grades and subiects, students interact 
vvith a common learning management system called the Summit Learning Platform (SLP). The 
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platform houses teacher-curated digital learning resources, such as online videos, and tvvo types 
of assessments based on “playlists” One of the tvvo types of assessments referred to as a diag- 
nostic assessment is used to identify gaps in students” knovvledge and a summative assessment, 
referred to as a content assessment, pulls randomly selected items from a large item-bank and is 
used to identify vvhether students have achieved mastery for a focal content area. Students can 
access resources and take assessments at their ovvn discretion and as many times as necessary. 
Students spend 3006 of their instructional time engaged in self-directed learning as they vvork 
to complete multiple summative assessments for a course and spend the remaining percentage 
of instructional time (i.e., 7096) engaged in profect-based learning, vvhich is an approach to 
instruction vvhere students gain knovvledge, skills, and productive dispositions by developing 
authentic products that are organized around broad and motivating driving questions (Larmer, 
Mergendoller, 8k Boss, 2015). 

A challenge for any partnership is developing a focus for the partnerships vvork (Penuel 8: 
Gallagher, 2017). To set the research direction for the partnership, vve engaged in a multi-meet- 
ing, iterative process of having practitioners from Summit brainstorm topics and questions 
and having researchers reflect back and react to each question. Based on this process, the first 
data-intensive analyses that the partnership engaged in addressed the vvays in vvhich students 
attempted and completed content assessments. As noted previously, students have discretion 
in terms of vvhen they attempt content assessments and hovv they prepare for them. Using data 
from students” interaction vvith the Summit Learning Platform, vve initially examined relation- 
ships among students” standardized test performances on the NVVEA MAP their use of teach- 
er-curated resources, and their content assessment taking in relation to course grades. Based 
on these analyses and vvithin math courses, vve observed that students vvho had lovver incoming 
MAP math test scores tended to attempt content assessments more frequently. Beyond this, 
vve also observed that students vvith higher MAP math scores used the Summit Learning Plat- 
form in different vvays than their peers vvith lovver incoming scores. For example, students vvith 
higher incoming test scores, on average, used rnore unique learning resources. For example, on 
one math playlist called “linear functions,” students in the lovvest quintile on the MAP Math 
test used approximately 17 unique teacher-curated resources vvhereas students in the highest 
quintile used approximately 21. These same students also, as compared to their peers, used more 
resources prior to taking their first content assessment and overall attempted content assess- 
ments many feveer times. Overall, these early analyses hinted at the potential for using data from 
the platform to inform instruction—it provided a vvindovv into the processes that students vvere 
engaging in that held potential for explaining students” eventual performances on individual 
playlists and for the course overall, 

A key element of the overall partnership vvas vvorking vvith practitioners at multiple levels 
of Summit—from organizational leaders to teachers—to make sense of data generated by the 
platform. Over time, vve came to vievv opportunities to vvork directly vvith practitioners as learıı- 
ing events (Cobb 6 y)ackson, 2012). These learning events proved to be the primary İocations 
for going from a data product developed by researchers to a set of implications that vvould kick 
off the development of concrete instructional change ideas. At a general level, learning events 
involved structured activities vvrhere members ofthe partnership developed nevv understandings 
by engaging in ?oint vvork. Types of learning events included simple meetings vvhere members 
of the partnership ?ointly interpreted data products, but learning events also included struc- 
tured co-design session (Penuel, Roschelle, 6: Shechtman, 2007), opportunities vvhere teachers 
could vvork first hand vvith data products that researchers developed, and vvorkshops vvhere 
researchers provided explicit instruction on data analysis softvvare. Learning events provided 
opportunities for both researchers and practitioners to use their respective expertise to make 
the data useful for instructional improvement. 
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At a multi-day learning event referred to as a dafa sprint, Summit staff vvorked directly vvith 
data and engaged in data vvrangling, exploration, and modeling tasks in collaboration vvith 
researchers. One data product that the partnership built upon out of this event involved an anal- 
ysis that identifled students vvho scored lovv on an assessment and follovved it up vvith another 
assessment—and often another lovr score. This eycle of repeated, negative assessment taking 
vvas thought to stall students” progress and lead to, in some cases, students falling further behind 
their peers. Using math courses once again, the partnership operationalized these patterns as 
conditional probabilities (i.e., conditional on a student not succeeding on an assessment, vvhat is 
he or she likely to do next based on prior use of the platform?), and then scaled these analyses to 
include all grades and courses taught at Summit. These follovv up analyses revealed that patterns 
referred to as adverse transitions vvere correlated vvith poorer performances across a range of 
courses, and also that these patterns declined in frequency over time, vrhich demonstrated that 
students gradually stopped making these transitions. 

Along vvith the types of transitions that students made follovving a lovv score on an assess- 
ment, vve also explored students” use of learning resources across playlists. Recall that each play- 
list in the Summit Learning Platform is comprised of both assessments and resources, and that 
students are expected to use resources in order to help them pass assessments. VVe used an 
unsupervised machine learning approach referred to as hierarchical cluster analyses, combined 
vvith a heat map visualization, to explore patterns in students” resource use (e.g., Bovvers, 2010). 
Much as vvith the conditional probability analyses coming out of the data sprint, an important 
next step follovving the resource-use heat map analyses vvas scaling the visualization to include 
all grades and subiects. Key to making analyses useful to practitioners and avoiding over gen- 
eralizing a finding, vve explored vvithin-courses patterns of resource use in order to control for 
variations in content and the developmental differences of students across grades. Taking an 
analysis to scale meant examining vvhether a pattern identifled in a handful of courses appeared 
in other courses. "The ability to run analyses on one course and then on all courses proved to be 
an important value that the research team brought to the overall partnership. 

Our partnership vyith Summit highlighted the vvays in vvhich researchers and practitioners 
can come together to ?ointly analyze and take action on data from digital learning environ- 
ments. The design-based nature of the proyect, vvhich vvas organized around bringing research- 
ers closer to practice and practitioners closer to research, surfaced multiple factors associated 
vvith data-intensive partnerships and helped in clarifying the multiple steps that can be involved 
in using large, complex datasets to improve instruction. 


Case 2: Measuring Productive Persistence to Help Faculty and Students 


The second of tvvo cases described in this chapter that vvas central to the development of collabo- 
rative data-intensive improvement as an approach phases and conditions vvas vvith the Carnegie 
Foundation for the Advancement of Teaching (Carnegie) and the Carnegie Math Pathvvays. At 
the start of our vvork together, Carnegie vvas vvell into launching and supporting the Pathvvays, 
vyhich is a national effort focused on improving developmental, or remedial, math courses in 
tvvo- and four-year colleges. At many colleges, these courses are significant barriers to a students 
college completion (Bailey, feong 6: Cho, 2010). To help more students get past the hurdle of 
developmental mathematics, Carnegie brought together researchers and practitioners around 
the tools and routines of improvement science (Langley, Moen, Nolan, Nolan, Norman, 8: Pro- 
vost, 2009). As the “hub” of a developing group researchers and community colleges, Carnegie 
formed a netvvorked improvement community (NIC) in an effort to accelerate learning and 
improvement among netvvork members (Bryk, Gomez, Grunov, 6t LeMahieu, 2015). Members 
of the Carnegie Math Pathvvays NIC designed tvvo different course sequences geared tovvard 
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helping students fulfill their developmental math requirements as vvell as earn college credit in 
either statistics (i.e., “Statvvay”) or quantitative reasoning (i.e., “Quantvvay”). 

"The success of both Statvyay and Quantvvay are vvell documented (e.g., Yamada, 2017, Yamada, 
Bohannon, 86: Grunov,, 2016, Yamada 86: Bryk, 2016). Key to the success of the Carnegie Math 
Pathvvays NIC is a systemic approach supported by the use of improvement tools and routines. 
One component of CarnegieS systemic approach is a focus on “noncognitive” factors that affect 
student success (see Yeager 6: VValton, 2011, Zimmerman, 2002). Many of these factors cen- 
ter on students persisting through failure and using good learning strategies, vvhich Carnegie 
defines as “productive persistence” At the beginning of our partnership, Carnegie vvanted to 
explore hovr data from various online learning systems used in the Pathvvays could be leveraged 
in measuring and supporting students” academic tenacity and use of effective learning strategies 
(see Krumm et al,, 2016). 

VVe began vvorking vvith data from Statvvay5 online learning system at the time, vvhich vvas 
the Online Learning Tnitiative (OLT) platform. The platform collected information on each page 
that a student vievved as part of the Statvvay curriculum, vvhen the page vvas vievveed as vvell 
as information on a variety of assessments housed vvithin the system. Through the platform, 
Statvvay provided students vvith practice assessments that students could use to test their ovvn 
knovvledge embedded vrithin the material that they vvere reading. Each item that vvas attempted 
on an assessment, vvhen it vvas attempted, and vvhether an item vvas ansvvered correctly or not 
vvere collected and stored by the Online Learning Tnitiative platform. Along vvith page-vievvs 
and practice assessments, the platform also captured time- and item-level data from assess- 
ments referred to as “Checkpoints,” vvhich are quiz-like assessments that come at the end of 
“topics” and “modules” that make up the Statvvay curriculum. 

In the fall of 2014, vve started to explore the vvays in vrhich the online system vvas used across 
individual Pathvvays courses. One of the benefits of looking at data stemming from the Online 
Learning Tnitiative platform vvas that these data vvere collected unobtrusively at the scale of 
entire Pathvvays NIC. These data vvere unobtrusive in that they vvere gathered directly from 
students as they engaged in learning activities based on vvhat vvas programmed to be captured 
by the online learning system. VVhile large volumes of data could be collected, that did not mean 
that all of it vvould prove to be useful for understanding students” learning behaviors, strategies, 
or outcomes. One step involved in identifying useful data involved becoming familiar vvith stu- 
dents” experience of using the Online Learning Tnitiative platform, such as the vvays in vvhich 
students could read pages, practice material, and take formal assessments—along vvith the vvays 
in vvhich these data vvere collected and stored by the system. 

One the first exploratory analyses that vve conducted focused on the dates vvith vvhich stu- 
dents submitted Checkpoints. VVe vvere interested in identifying hovv much variation there vvas 
among students vvithin a course for vvhen they turned in a Checkpointas vvell as betvveen courses 
for vvhen, on average or modally, students completed a Checkpoint. These analyses revealed that 
approximately half of all Statvvay courses had a modal pattern that follovved the intended order 
of Checkpoints and that individuals as vvell as courses that follovved the intended order tended 
to perform better in terms of end-of-course grades. VVe follovved up course-level analyses by 
further exploring students” use of the online system by focusing on the “session” as the level of 
analysis. A session vvas defined by the online environment as the time betvveen logging into the 
system and logging out or being timed out of the system. VVe explored patterns in vhat students 
did before and after a lovv score on a Checkpoint (i.e., a score belovv 6096). A key component of 
productive persistence is persisting vvith a task after experiencing challenge or difficulty. Lovv 
scores on Checkpoints offered a unique opportunity to measure these behaviors (Krumm et al, 
2016). VVe also explored the number of sessions a student logged per vveek, the number of days 
betvveen each session, and the types of sessions that students logged, such as assessment only 
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sessions vyhere students only vvorked on Checkpoints or robusf sessions vvhere students engaged 
in reading, practicing, and assessment activities vvithin the same session. All of these different 
operationalizations helped in understanding the vvays in vvhich productive persistence played 
out and could be measured using data from the OLI platform. 

One vay in vvhich vve sought to expand the use of these measures vvas to put them in front 
of Statvvay faculty to better understand hovr vvell they captured students” learning strategies and 
behaviors as vvell as vvhether they could be used to help faculty more effectively intervene vvith 
students. In vvorking directly vvith faculty, vve organized design vvorkshops that vvere geared tovvard 
iointly interpreting data products that the research team provided and co-developing data prod- 
ucts and follovv up actions, such as change ideas that faculty could implement using a data prod- 
uct. Design vvorkshops vvere structured activities vvhere researchers vyould present evidence on 
students” use of the online learning system and instructors vvould co-develop additional data prod- 
ucts, follovv-actions, or both. Over time, the partnership vievved design vvorkshops as the location 
vyhere data became actionable. Despite the sophistication of any analysis, no data product proved 
to be actionable in and of itself, each data product required an explicit action to be developed. 

One of our first vvorkshops vvas organized around developing instrumental data products 
related to students” productive persistence vvithin the Online Learning lInitiative platform. Out- 
comes from this first vvorkshop included finding nevr vvays to operationalize students” engage- 
ment vvith online learning materials over time and creating data products that captured vvhat 
students did alongside hovv vvell they did. For a second design vvorkshop, evidence for the 
importance of attempting and succeeding at Checkpoints had been building across multiple 
analyses, and the data products and change ideas that vvere developed during this vvorkshop 
led to a focused improvement profect related to students completing Checkpoints. An initial 
improvement sprint follovving the vvorkshop led to demonstrable increases in students complet- 
ing end-of-module Checkpoints (Meyer, Krumm, $: Grunovv, 2017). 

Across multiple iterations, the design vvorkshops themselves as vvell as the data products and 
change ideas that vvere produced to support them proved to be valuable for both researchers 
and practitioners. For researchers, they offered venues for learning from faculty on vvhat they 
found meaningful and vrhether certain patterns that vvere identifted had face validity. For prac- 
titioners, they offered an efficient touch-point for engaging in data-intensive research activities. 
VVhile they offered efficiencies for practitioners, they required significant pre-vvork on the part 
of the research team both in terms of data analysis and in organizing the vvorkshops themselves. 
Follovving up vvith practitioners after a vvorkshop vvas key to the overall success of the vvorkshop. 
Overall, these vvorkshops vvere a potent strategy for translating findings from data-intensive 
analyses into changes in instructional practices. 


Case 3: Learning Analytics for Supporting Novice Programmers 
The Context of Introductory Programming in K-12 Classrooms 


Policy and educational leaders see computer science (CS) and computational thinking (CT) 
skills (Grover 6: Pea, 2013, 2018, VVing, 2006) as necessary for all citizens, not only computer 
scientists, vvith a vievv to building a strong STEM pipeline. Such problem-solving skilİs are seen 
as necessary to succeed and innovate in a vvorlİd infused vvith —and lives shaped by—computing 
and digital devices. 

Most K-12 computer science courses teach programming to support learning of computational 
thinking practices such as logical and algorithmic thinking, decomposing problems, debugging, and 
use of computational thinking concepfs to create solutions that can be executed by a computer. Hovv- 
ever, programming has historically been difficult for novices to learn (e.g., Pea öz Kurland, 1984, 


Applying Learning Analytics to Support İnstruction e 133 


Solovvay 6: Spohrer, 1989). This is because programming is a complex activity that involves under- 
standing a problem as a computational task, mapping a design for the program, dravving on prob- 
lems previously programmed that have a similar structure, instantiating abstract program patterns, 
coding the program, and then testing and debugging (Pea 8: Kurland, 1984). Tt involves not only 
issues of syntax of the programming environment but also the semantics of putting together compu- 
tational solutions as vvell as strategies and pragmatics such as testing and debugging the code. 

"These problems persist for novices despite the emergence of block-based programming envi- 
ronments that provide a visual programming interface that makes it easy for novices to get 
started vvith creating programs and animations vvithout vvorrying about issues of programming 
syntax. Hovvever, these environments do not currently aid in formative assessment of the use 
of computational thinking practices and disciplinary concepts of computing to aid the learn- 
ing process in the context of programming. Examining programming process using learning 
analytics (LA) gives a more complete picture (Baker 6: Siemens, 2014). Being able to support 
and scaffold this process requires us to have the ability to defect and recognize actions (single 
or sequences of multiple actions taken together) as evidence in support for or against the use 
of computational thinking. Thus, students” actions need to be interpreted as they vvork so that 
formative feedback can be provided to steer learning. 

Recent learning analytics vvork in the context of programming has included analyzing stu- 
dents” steps to a solution using data from digital environments such as number ofactions in stu- 
dents” programs and number of successful and unsuccessful program compilations (Blikstein 
et al,, 2014). The use of clustering techniques (Bouchet, Harley, Trevors, 6c Azevedo, 2013) has 
led to identifying various programmer behavior profiles, and unsupervised methods have been 
used to derive program-state patterns and state transitions to predict success outcomes (Ber- 
land, Martin, Benton, Petrick Smith, 6: Davis, 2013). Most of these techniques have involved 
looking for patterns in data largely from the “bottom-up” (VVinne $: Baker, 2013). 

Nevr hybrid or blended LA have begun to assess students” learning processes in digital learn- 
ing environments for science 6: math that combine top-dovvn and bottom-up approaches to 
better understand students” knovvledge and skills. Examples include Gobert, Sao Pedro, Razi- 
uddin, and Baker (2013), Shute and Ventura (2013), and Zapata-Rivera, Liu, et al. (2016). 
Many of these by combining bottom-up LA vvith Evidence Centered Design (ECD, Mislevy, 
Almond, 6: Lukas, 2003), a principled approach to guide assessment design for top-dovvn, 
hypothesis-driven generation of a priori patterns about learner actions. Evidence Centered 
Design focuses on three related models: student (vvhat are targeted cognitive constructs?), task 
(vvhat activities allovv students to demonstrate cognitive constructs?), and evidence (vhat data 
provide evidence of cognitive constructs?). It helps connect important constructs that vve vvant 
to measure vvith observable behaviors (including patterns oflearner actions). Also, importantly, 
evidence is obtained by deliberately putting students in situations or tasks that vvill elicit the 
needed evidence. Once semantically meaningful patterns are defined a priori, data mining and 
learning analytics techniques can be used to analyze the patterns further. 

VVe present a theoretical framevvork that researchers can use to design measurement sys- 
tems for programming environments for research or application. VVe are using this framevvork 
currently as part of a broader effort to study and detect patterns of learner behavior during 
programminsg, as a first step tovvard being able to provide feedback to the learner and instructor 
about student learning in real-time. 


Exploratory VVork as a Backdrop to the Evolution of a Framevvork 


VVe analyzed a dataset from an assessment task designed and used in prior research (VVerner, 
Denner, Campe, 6: Kayvamoto, 2012). 118 females and 202 males aged 10 to 14 years completed 
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the 30-minute task vvhich involved modifying existing code in the Alice programming envi- 
ronment (Dann, Cooper, 6: Pausch, 2009). Students” computer programs and Alice data logs 
vvere collected, and the programs vvere scored manually using a rubric for computational think- 
ing including algorithmic thinking and abstraction. VVe applied Evidence Centered Design 
to “reverse engineer” this task into specific computational thinking concepts and skills and 
give evidence of vvhat those might look like in log files. VVe also compared action sequences 
betvveen students vvho scored high and lovr (relative to the median) to determine commonality 
of sequences for each group. VVe found sequences that vvere significantly more common among 
students vvith high grades and one sequence that occurred significantly more frequently for stu- 
dents vvith lovv grades. Our analysis shovved positive correlations among higher grades, number 
of code edit actions, and numler of testing events. 

Through our exploratory vvork, vve gained insights into interpreting student actions from 
logs. Hovvever, vve also discovered that tasks need to be complex enough to yield rich process 
data İogs as students apply more strategic computational thinking skills for befter coverage of 
focal construcis. Measuring learning through automated means requires evidence of appropriate 
as vvell as repeated use of constructs (Koedinger, Corbett, 8 Perfetti, 2012). Lastly, it became 
apparent that vvithout additional measures for ground-truthing or mapping the sequences back 
to specific instances in students” programming progressions that vve have evidence for, one can- 
not validly interpret such sequences. 


A Frameyvork for Blending Hypothesis- and Data-Driven Learning Analytics 


Building on the learning from our exploratory vvork, vve designed a framevvork, or process, 
that employs Evidence Centered Design in its typical forvvard-design application, beginning 
from important focal knovledge and skills, and proceeding to task implementation. This 
approach yields an overall methodology for combining top-dovvn Evidence Centered Design- 
like approaches to assessment development and delivery vvith bottom-up, data-driven LA 
approaches. 

The framevvork (Figure 9.1) describes an iterative process that begins vvith identifying 
important computational thinking concepts and practices that vve vvould like to measure. Care- 
ful design of tasks put students in situations that evoke behaviors to provide potential observ- 
ables of these concepts and practices. Detailed analysis of program code from different solutions 
reveals students” use of constructs (correct or othervvise) and varied approaches to solutions. 
Similarly, analyzing data from screen recording and/or in-person “over-the-shoulder” obser- 
vations reveals aspects of students” actions that are never seen in the final program. These can 
reveal student misunderstanding of concepts even if the final solution seemingly demonstrates 
appropriate usage of constructs. Combined qualitative analyses of the program solutions along 
vvith data-driven examination of programming process of a designed task together provide a 
deeper understanding of students” actions than is possible from data-driven analytics alone, 
including potential code sequences that map to practices that are identifled through data logs. 
"These a priori patterns lay the foundation for detectors for these patterns and provide a richer 
interpretation of student process in programming environments. 


Applying the Framevvork 


Guided by the framevvork, vve applied Evidence Centered Design for the design of programming 
tasks to generate richer process data to observe repeated use of constructs and computational 
thinking practices. Tvvo such tasks vvere piloted in tvvo high school introductory computer sci- 
ence classrooms vvith 27 and 28 students. Data included final Alice files and log data for all 
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Figure 8.1 A Framevork for hypothesis-driven analyses to support data-driven analytics. 
(Grover at al., 2017) 


students and screen recordings for six students. Analysis of 1ogs revealed similar issues that 
students struggled vvith in both tasks, for example, hard-vvired vs. general solutions, improper 
termination conditions, decisions pertaining to parallel vs. sequential execution, effective solu- 
tion decomposition, and (in-)appropriate random number use. 

Analyses of screen captures from the six students using a “process over product” lens to 
assess computational thinking practices suggested that some students demonstrated abstrac- 
tion, modularization, and testing in parts vvhile others did not. Such observations vvill serve as 
useful patterns to search for in students” log data as evidence for computational thinking skills. 
In addition, vve noticed certain phases during students” programming process vvhen a student 
vvas unable to progress. Such situations can easily lead to frustration and loss of engagement and 
can thus serve as good candidates for potential patterns to be detected as students vvork on their 
assessment tasks. Detection of such flailing behaviors in real-time can help a teacher identify 
vyhen to help a student. 

Task piloting and analysis led to more refined tasks that vvere then used in three high school 
classrooms in the VVestern US veith a total of close to 100 students. Data collection also included 
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screen recordings and intervievvs vvith three students in each classroom. The screen recordings 
are being used to validate program snapshots created from log data, in addition to aiding stu- 
dent recall of process during one-on-one intervievvs vvith each student. These intervievvs are also 
being used to ascertain the nature and timing of help that students may have liked to support 
their vvork. "his vvill help us understand the nature of formative feedback and supports that can 
scaffold learning for students during programming. 


Case 4: Learning Analytics Enabled Formative Assessment and Changes in Teacher s 
Instructional Practices 


Homevvork is already required in schools and a meaningful amount of instructional time is 
allocated to homevvork (Fairman, Porter, 6: Fisher, 2015, Loveless, 2014). But it is also contro- 
versial and perceived as needing improvement (Kohn, 2006, Bennett $: Kalish, 2006, Traut- 
vvein 6: Koller, 2003). Online homevvork tools can provide immediate feedback to students and 
real-time information for teachers to monitor student progress. In this section, vve focus on 
hovr teachers” homevvork revievr practices change vvhen they have access to data on student 
homevvork performance and the role of such formative assessment data for informing teachers” 
instructional decisions and adaptations. 


Formative Assessment and Data Use in School 


The concept of formative assessment has received much attention in K-12 research and practi- 
tioner communities (Black 8: VVilliam, 1998a, 1998b, Boston, 2002, Heritage 6: Popham, 2013, 
Roediger 8: Karpicke, 2006). Researchers and practitioners characterize formative assessment 
as a process that uses student data to inform adaptive changes in instruction (Bennett, 2011, 
Brookhart, 2007, Guskey, 2007, Heritage 6: Popham, 2013). The grovving interest in formative 
assessment is, in part, an outcome of the general dissatisfaction vvith the quality of information 
obtained from summative assessments that generally do not provide sufficiently fine-grained 
or timely feedback on student learning (McMillan, 2007, VViliam, 2016). Research documents 
modest to medium effect sizes of formative assessment on student learning (Black 8: VViliam, 
2009: Brookhart, 2007, Guskey, 2007, Hattie, 2009, Kingston 6: Nash, 2012, Shavelson, 2008, 
Speece, Molloy, 8z Case, 2003, Thum, Tarasavva, Hegedus, Yun, 6z Bovve, 2015) for a variety of 
different modes, grade levels, content areas, and cultural settings. Frequent use of formative 
assessments can improve achievement, particularly vyhen the results are used to adyust instruc- 
tion (Bergan, Sladeczek, Schvvarz, 6: Smith, 1991, Speece, Molloy, 8: Case, 2003). 

In recent years, administrator and teacher use of student assessment data to inform instruc- 
tional decisions has been at the forefront of instructional reforms (Means, Padilla, DeBarger, 6x 
Bakia, 2009). Advocates of these reforms emphasize that teaching should be responsive to stu- 
dent needs and assessment data is essential to enable teachers to adiust instruction to better 
support individual learners. The expectation is that teachers skilled in data use vvill develop 
more effective classroom and instructional practices (Mandinach 8: Gummer, 2016). 


The ASSISTments Online Homevvork Support Tool 


ASSISTments is a vveb-based platform that provides support to students as they solve math- 
ematics problems and provides detailed student-level and class-level formative assessment 
data to teachers to help inform adiustments in classroom instruction and pacing (Heffernan 8: 
Heffernan, 2014). Prior small-scaled studies shovving the promise of ASSISTments have been 
synthesized (Rittle-lohnson 8: ordan, 2016). In a recent efficacy trial funded by the TES, SRI 
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recruited 46 middle schools from Maine, including 87 teachers and over 2,800 seventh-grade 
students. In the study schools vyere randomly assigned to the treatment or control condition for 
tvvo years. Schools assigned to the control condition continued vvith their homevvork practices 
as they normally vvould. For the schools assigned to the treatment condition, in the first year, 
teachers received professional development and practiced using ASSISTments vvith their sev- 
enth-grade classes. In the second year, these teachers continued to use ASSISTments vvith a nevv 
cohort of seventh-grade students, vvho vvere the student-level study population. The TerraNova 
Common Core mathematics assessment vvas administered to students to measure end-of-year 
outcomers. Using a hierarchical linear model (HLM), vve analyzed student outcomes by con- 
dition. The adiusted mean scores on the TerraNova vvere 8.84 points higher in the treatment 
condition, and this result vvas statistically significant (effect size g — .18, p — .007) (Roschelle, 
Feng, Murphy, 8: Mason, 2016). According to published technical norms (CTB/McGrav-Hill, 
2012) that relate TerraNova scale scores to grade level equivalents, the degree of improvement 
corresponds to vvhat vvould be expected from .5 to one years of additional learning time. 

A key component of ASSISTments is the easy-to-use online reports. The system analyses use 
log of all students and generates Ttem Report that shovvs data for each student on every problem 
and each math skill covered in the assignment, vvhich questions and/or skills vvere particularİly 
challenging, and vvhat the common vvrong ansveers vvere. The data allovvs teachers to make real- 
time, informed decisions about vvhat to teach next, and it is ideally used to guide homevvork 
revlevv. Figure 9.2 shovvs an item report vvith the results of six items. Teachers can see the per- 
cent correct per problem and use that data to identify vveaknesses for the class. The common 
vyrong ansvvers support cognitive diagnosis of misconceptions. Problem numbers appear across 
the top rovv, class-level results appear in the next rovvs, and individual student results appear on 
anonymized rovvs belovv. Each cell shovvs vrhat the student entered first. The cell vvill be yellovv 
if the student had to be shovn the ansvver. 

Research has found that teachers do not typically knovv hovr to use data to inform instruction 
(Mandinach 8: Gummar, 2016, Means et al., 2009) or they could make errors vvhen trying to make 
sense of score report results (Zapata-Rivera, Zvvick, 6: Vezzu, 2016). Pape et al. (2013), hovvever, 
found that vvhen both professional development and formative assessment technology are pro- 
vided, teachers can learn more about their students and adapt instruction vvith resulting improve- 
ments in student outcomes. In the ASSISTments model, teachers received a total of five days of 
professional development. The professional development entails discussions of the foundational 
instructional and learning theories behind ASSISTments as vvell as practical issues associated vvith 
the use of ASSISTments. Teachers learn hovr to use the system and hovv to interpret ASSISTments 
reports. They also receive advice on practical instructional strategies for responding to students 
vvith different needs. "he later sessions focused on helping teachers sharpen their ability to adiust 
instruction in response to information in the reports and refine their class routines. 


Impact of Using ASSISTments Reports on Teacher Practices 


During the Maine efficacy study, extensive data vvas collected through system use log, inter- 
vievvs vvith teachers and school principals, teacher observations, surveys, and logs, and sys- 
tem use data generated vvithin ASSISTments. AlI the data focus on the implementation process 
leading to outcomes. VVe analyzed and triangulated data from different sources to see vrhether 
teachers used data from ASSISTments reports as a part of the implementation of the interven- 
tion and vvhether teacher" instructional practices changed given the availability of the formative 
assessment data. 

Based on the teacher use log data, vve measured the proportion of ASSISTments reports 
that a teacher opened at least once. Opening a report is an important indicator of vvhether 
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the teacher is using ASSISTments to revievv student vvork and is a precursor to using ASSIST- 
ments to adapt instruction. Across classrooms, the median for report-opening vvas 6406, vyhich 
is above the expected opening rate (5006). 

In the instructional logs and surveys, teachers vvere asked vrhether they revievved all home- 
vvork problems, they asked students vvhich problems to revievv, or they revievved selected prob- 
lems based on students performance (aka. data-driven targeted revievv). VVe found that the 
intervention had statistically significant effects on homevvork reviev practices. VVhen the con- 
tinuous variable (based on teacher logs) vvas used as the outcome measure, the effect size vvas 
1.23, pz0.005 and for the dichotomous variable (based on the teacher survey), the odds-ratio 
vvas 45.8,) p-0.001. Amonsg all 38 treatment teachers vvho responded to the survey, 37 reported 
doing targeted homevvork revievv. VVhile in the control condition, 12 ofthe 36 teachers reported 
that they didnt do targeted homevvork revievv. 

Analysis of intervievv transcripts and classroom observations data also provided convergent 
evidence of shifts in teaching practices. These shifts centered around three areas: 


1. Targeted in-class revievv of homevvork problems and concepts based on needs of stu- 
dents. Compared to those in the control condition, treatment teachers vvere more likely 
to focus their homevvork revievv (p € .01) to cover fevver number of homevvork prob- 
lems but in more depth. Teachers stated that the item report provided a starting point for 
their instructional planning: they revievved the item report to quickly identify problems 
vyhere a mafority of students struggled and the common vvrong ansvvers and purposefully 
select vvhich concepts they needed to revievv during the class. In contrary, control teachers 
relied on students” vvillingness to ask for help on certain problems, random or sequential 
selection of homevvork problems for revievy, recitation of correct ansvvers of all home- 
vvork problems but not demonstrating or discussing solution procedures, or proğection of 
ansvvers for students to self-correct. 

2. Use of data from homevvork to initiate and motivate homevvork discussion. Present- 
ing reports engages students directly vvith data on the homevvork results and reduces 
students” reluctance to ask for help on problems, as they could see that other students 
struggled vvith some of the same problems. This helped to create a safe classroom envi- 
ronment vvhere students vvere more vvilling to speak up and engaged in the discussion of 
homevvork. 

3. Use of homevvork data to inform instructional decisions during subsequent lessons. 
Treatment teachers acknovledged that they used the data from ASSISTments to inform 
instructional decisions broadiİy. These decisions included: instructional pacing, vrhat con- 
cepts they needed to address during subsequent lessons, and/or vrhich students to pro- 
vide more instructional support. Treatment teachers vievved the ASSISTments reports asa 
valuable resource for understanding hovv students performed on the homevvork generally 
but more specifically hovv vvell students understood or struggled vvith certain concepts 
and procedures. 


Conclusion and Discussion 


In this chapter, vve presented four cases that demonstrate hovv learning analytics can be used 
to improve learning environments across different grade levels and subiects. VVhile data from 
digital learning environments, administrative data systems, as vvell as sensors and recording 
devices can be used to support instructional improvement, it is important to recognize that 
these improvements are as much about the supporting vvork of researchers and practitioners as 
they are about the data themselves—data is not a self-activating resource as it requires teams 
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of individuals to interpret, derive implications, and develop change ideas. Across the four cases 
described in this chapter, researchers and practitioners vvorking in collaboration as vvell as the 
use of approaches such as Evidence Centered Design can provide structures and activities for 
translating data into an instructional change. 

Key to the types of data addressed across the four cases is that they originated from processes 
that preceded a valued outcome, such as an end-of-course grade or standardized test perfor- 
mance. VVhile these data can be collected from activities over time, they still need to be reliable, 
valid measures of those processes. One challenge to creating valid measures is that the technol- 
ogy from vvhich the data are being collected may not collect all of the relevant data (Krumm, 
Means, 6: Bienkovvski, 2018). A great deal of vyork and energy can go into analyzing these data, 
all the vvhile critical instructional activities are occurring outside of the learning system. VVork- 
ing directly vvith practitioners can help in better understanding the instructional context in 
vvhich technologies are used as vvell as in making more informed interpretations of the events 
that are captured by a technology. Moreover, approaches such as Evidence Centered Design 
provide a framevvork for interpreting available data and in developing an evidence-based argu- 
ment around vvhat processes are being measured. 

Assessment in online learning has been studied for a number of years, but only recently, 
researchers have begun promoting and advocating the use of learning analytics for assessing 
academic progress, predicting future performance, and spotting potential problematic issues 
Çohnson, Smith, VVillis, Levine, 6: Hayvvood, 2011, p. 28). The Gordon Commission (2013) 
recommends “separate responsibility for the use of data dravvn from rich descriptions of these 
transactions for administrative and for student development purposes. Teachers vvould be 
enabled to interpret these data diagnostically and prescriptively” (p. 15). VVhen using learn- 
ing analytics for assessment, researchers are urged to differentiate assessment of learning (e.g. 
summative assessment) versus assessment for learning (e.g. formative assessment, diagnostic 
assessment). VVhen the purpose of the assessment differs, the design of the learning taskös focal 
knovledge, its features and timing (e.g. vrhen learning is still happening vs. vrhen learning has 
completed), its alignment vvith learning standards, potential observations, and inferences from 
the tasks shall be adiusted accordingly. Learning analytics can be a poveerful tool for forma- 
tive assessment, and for instructors to take corrective measures and monitor progress. Data 
collected through learning environments tends to be rich, multi-dimensional, longitudinal, 
embedded, and importantly —inexpensive. Such data can provide opportunities for assessing 
learners at a much finer-grained scale than a traditional exam: vve can not only score an ansvver 
entered by a learner right or vvrong, but also look at characteristics of hovv learners ansvver 
the question, such as hovv long it took them to ansvver, or vvhether their mouse hovered over 
a vrrong ansvver for a vvhile, to gauge the level of performance and confidence. On the other 
hand, such data can also be noisy as compared to data collected from more controlled testing 
environments. VVhile there are promising applications as shovvn in case 3, strong evidences are 
vvarranted vvith regard to reliability and validity of the measures produced by learning analytics 
(Tannenbaum, this volume). 
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Note 


1 VVe noticed that the odds-ratio for the dichotomous mediator is very big. This vvas possibly due to the lack of vari- 
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Evaluating Students” Interpretation of 
Feedback in Interactive Dashboards 


Linda Corrin 


Dashboards have long been used in business and engineering fields to provide users vvith a con- 
solidated vievv of data to inform decision making. These decision makers are most often experts 
in their profession (for example, sales managers in business or pilots in engineering), vvho bring 
their expertise into the process of interpreting the data provided through the dashboard vievv. 
Dashboards are designed to use data to communicate information about areas that may need 
attention and action (Fevv, 2013). The rise of “big data” across many industries has prompted 
nevv and innovative approaches to bringing together and displaying this data in vvays that are 
meaningful and informative. VVith increasing amounts of data being collected about students” 
behaviour in learning environmenits, it is therefore not surprising that the idea of building dash- 
boards to provide an overvievv of student progress and performance has also become popu- 
lar in education, sparking a range of dashboard development for students across all stages of 
education. 

In the educational context, learning dashboards have been defined as: “a single display that 
aggregates different indicators about learner(s), learning process(es) and/or learning contex- 
t(s) into one or multiple visualisations” (Sechvvendimann et al., 2017). VVhile the mafority of 
dashboards developed in education initially focused on providing information to teachers and 
administrators, an increasing number of student-facing dashboards are starting to emerge. For 
students, dashboards provide an opportunity to gain feedback on their learning activities and 
assessments, providing evidence to inform decisions around hovr they approach their study. 
Many universities, schools, learning management system vendors, and other educational tech- 
nology companies are currently exploring innovative vvays to deliver interactive dashboards to 
students vvhich incorporate useful information displayed in vvays that are easily interpretable 
by students. 

Hovvever, there is an emerging concern about students” ability to interpret the data provided 
in dashboards in a vvay that is beneficial to their learning (Clovv, 2013, Corrin 6 de Barba, 2014, 
Teasley, 2017). Research into student dashboards, to date, has tended to focus on measuring 
an increase in grade or a decrease in attrition in cohorts of students vyho have had access to a 
dashboard (Arnold $: Pistilli, 2012). Other studies have sought students” opinions about vhat 
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they vvould like to see in a dashboard prior to design and development (Roberts, Hovvell, Sea- 
man, 6: Gibson, 2016), or evaluated student satisfaction vvith dashboards once they have been 
implemented (Govaerts, Verbert, Duval, 6: Pardo, 2012). Hovvever, fevver studies have exam- 
ined students” interpretation of dashboards and the actions they take as a result of exposure to 
this feedback in detail. Understanding the vvays that students interact vvith and interpret data 
provided by interactive dashboards is vital in order to design effective dashboards that can sup- 
port student learning. Consequently, the development of more sophisticated vvays of evaluating 
students” interpretation of feedback delivered via dashboards is required. In establishing vvays 
to improve evaluation it is vvise to dravv on tested and established practices from fields such as 
score reporting to inform the vvays such evaluation is undertaken. 

"This chapter explores the role of interactive dashboards in educational environments and 
vvays in vvhich students” interpretation of feedback delivered through dashboards can be evalu- 
ated. This investigation is guided by the follovving questions: 


1. VVhat are the key design considerations and evaluation approaches used vvhen developing 
learning analytics dashboards to provide feedback to students? 

2. VVhat lessons can be learnt from the score reporting literature that can guide the design 
and evaluation of learning analytics dashboards for students? 


The chapter vvill also include tvvo case studies of student-facing dashboards and the vvays that 
students” interpretation of these dashboards have been evaluated. The chapter concludes vvith 
a discussion of the importance of considered design of dashboards that link representations of 
feedback to educational theories and the design of learning and assessment activities. The vvays 
that support for student interpretation of dashboards can be delivered vvill also be discussed, 
including hovv approaches and principles from the score reporting literature can be used to 
guide the development of such support mechanisms. 


Background 


A decade ago the field of learning analytics emerged as the use of technology in education 
became more vvidespread and researchers began to recognise the value in the data automati- 
cally generated and collected by such technologies. The Society for Learning Analytics Research 
(SoLAR) vvas subsequently established in 2012 and define learning analytics as: “the measure- 
ment, collection, analysis and reporting of data about learners and their contexts, for purposes 
of understanding and optimising learning and the environments in vihich it occurs: Research 
and development in the field has grovvn exponentially over the past fevv years to encompass a 
vvide range of contexts, tools, framevvorks, and issues. A key strength of the learning analytics 
community is that it brings together researchers and developers from across multiple disci- 
plines including education, learning sciences, computer science, and psychology. "This vvealth 
of perspectives and knovvledge offers great potential for the development of povverful tools and 
approaches to support and enhance student learning. 

Amongst the vvide range of learning analytics tools and techniques that have emerged, the 
idea of creating dashboards of data has featured prominently. This idea has appealed strongly, 
not only to learning analytics researchers, but also to educational institutions and educational 
technology vendors. The utilisation of dashboards is seen as a vvay to harness the huge amounts 
of data available from learning technologies and make this data accessible to those vvho can 
make best use of it. The mafority of learning analytics dashboards currently in use primar- 
ily focus on providing data to teachers and educational administrators. A recent study of 55 
learning dashboard profyects found that 7596 of dashboards vvere aimed at teachers, vvhile 5196 
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vvere aimed at students (2596 provided data for both students and teachers) (Schvvendimann 
et al,, 2017). A predominant focus of these systems has been to identify individuals or groups of 
students vvho are “at risk” of either lovv performance or failure. Many of the student-facing dash- 
boards available focus on providing students vvays of seeing vvhether they are at risk in relation 
to a single task or across a course of study. 

One oftthe earliest and most vvell-knovvn examples of this form of retention-focused student 
dashboard is the Course Signals system vvhich vvas implemented in 2007 at Purdue University 
(Arnold $: Pistilli, 2012). This systemis dashboard used a traffic light visualisation scheme to 
indicate a level of risk for students at different points throughout the semester (red — high risk, 
yellov z moderate risk, green £ lovv risk). The colours of the traffic lights are determined by a 
predictive algorithm vvhich incorporates data on student marks, interaction vvith the learning 
management system, prior academic history, and student demographics (e.g., age, residency, 
enrolled credits). In addition to the dashboard, the Course Signals system vvas designed to 
allovv teachers to implement appropriate intervention strategies for students such as sending 
emails/text messages or scheduling face-to-face meetings to discuss the risk to a students per- 
formance. Students also had the ability to click on their traffic light colour and receive a list 
of resources that can help them in their course. Early evaluation of the Course Signals system 
through surveys and focus groups found that most students (8896) reported a positive expe- 
rience of interacting vvith the system. The evaluation of the dashboard focused on measuring 
changes in performance, retention and students” self-reports of motivation taken at the end of 
the semester (Pistilli, Arnold $ç Bethune, 2012). Hovvever, subsequent analysis of the data has 
raised concerns about some of the findings of this evaluation due to the reverse-causality effect 
(Caulfield, 2013). This example highlights the importance of careful evaluation design in mea- 
suring impacts of learning analytics-based systems, including dashboards. 

Over time, many different forms of student-facing dashboard have been developed to 
address a range of different purposes. From helping students to monitor their activity and per- 
formance to providing evidence to promote self-reflection, dashboards have been built around 
a desire to allovv students to vievv their ovvrn data in order to promote sense-making. Recently a 
number of systematic revievvs of learning analytics dashboard design have been conducted on 
both teacher- and student-facing dashboards (Verbert et al,, 2014: Yoo, Lee, /o, 6 Park, 2015, 
Bodily 6 Verbert, 2017, /ivet, Scheffel, Drachsler, 6: Specht, 2017, Sehvvendimann et al,, 2017). 
"These revievvs have explored the purpose, design, and evaluation of dashboards in order to 
provide guidance on hovr dashboards can be designed effectively and used to support student 
learning. The next section of this chapter vvill explore the outcomes of these revievvs in relation 
to the first research question: VVhat are the key design considerations and evaluation approaches 
used vvhen developing learning analytics dashboards to provide feedback to students? 


Designing and Evaluating Learning Dashboards 


"The Verbert et al. (2014) revievv examined 24 papers on dashboards vvith 14 focused on student 
dashboards. The revievv profiled the type of user actions that vvere represented in the dashboards 
including artefacts produced, social interactions, time spent on tasks, resource use and activ- 
ity/assessment results. Of the 14 student-focused dashboards examined in the study, only 10 
reported details about the evaluation undertaken. These evaluations focused on the perceived 
usefulness, usability and effectiveness (including student satisfaction) of the dashboards. İt vvas 
observed that the results of these evaluations vvere mixed depending on the dashboard purpose 
and the data included in the dashboard design. VVhile some studies had reported an increase in 
grades, retention and self-assessment, others shovved no significant difference in these areas. It 
vvas concluded that there vvas limited consensus across the studies as to the most relevant data to 
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be included in dashboards and that more research is needed to consider vvhat other data about 
learners and the learning process could be useful, Tt is also suggested that more longitudinal 
approaches to evaluating the impact of dashboards on student learning are required. 

"The revievv conducted by Yoo et al. (2015) focused on educational dashboards based on data 
from learning management systems. 10 dashboards vvere included in the revievv vvith seven 
of these having a student-facing component. "he information presented in these dashboards 
included login trends, performance results, content usage, message analysis, online social net- 
vyorks, and at-risk student prediction. The revievr applied Kirkpatrick and Kirkpatrickis (2006) 
four level evaluation model (reaction, learning, behaviour and result) to each of these 10 dash- 
boards to assess the evaluation conducted on each. Only six of the 10 dashboard studies vvere 
found to have addressed any of the four levels, vyith only one dashboard study (Upton 8: Kay, 
2009) fully evaluating each of the four levels. Yoo et al. (2015) then vvent on to propose an eval- 
uation framevvork for educational dashboards vvhich brings together Kirkpatricks four level 
model (Kirkpatrick $: Kirkpatrick, 2006), Verbert et alös (2013) learning analytics process model 
(impact, sense-making, reflection and avvareness), and Fevvs (2009) blocks of information visu- 
alisation (see Table 10.1). 

The systematic revievv conducted by Schvvendimann et al. (2017) incorporated studies of 
55 dashboards, of vrhich 28 vvere student-facing. They identified six forms of data included in 
the dashboard designs (activity logs, learning artefacts, self-report data, institutional databases, 
physical activity and external systems) and 200 individual indicators, vvhich they categorised 
by hovv each related to the learner, action, content, results, social or context. Over half (5896) of 
the studies didnt contain any evaluation, but of those studies that did contain evaluation 6596 
involved a mixed methods approach vrhich combined quantitative and qualitative techniques. 
The evaluations tended to focus on usability, usefulness and user satisfaction, vvith very fevv 
exploring the impact of the dashboards on learning. The revievv authors observed that most 
evaluation methods appeared to be “lov-effort, lovv-detail (p. 38) and called for more compar- 
ative studies of dashboards, indicators, visualisations and impact. 

"The revievv by Bodily and Verbert (2017) broadened out the inclusion criteria to look not yust 
at student dashboards, but any form of student-facing analytics output. The revievv included 94 
articles covering student-facing learning analytics reporting systems designed for the purposes 
Of avvareness/reflection, recommendation of resources, improvement of retention or engage- 
ment, increasing social behaviour online, and recommendation of courses. Of the 94 systems 
revievved only 29 provided any interactive elements for students and very fevv (12) provided 


Table 10.1 Summary of the evaluation framevork for educational dashboards 
(adapted from Yoo et al., 2015). 


Criteria Sub-categories 


Reaction Goal-orientation 
Information usefulness 
Visual effectiveness 
Appropriation of visual representation 
User friendliness 


Learning Understanding 
Reflection 
Behaviour Learning motivation 


Behavioural change 
Result Performance improvement 
Competeney development 
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yustifications for the design choices made in relation to the dashboard design. In terms of eval- 
uation, 10 articles included some form of usability testing, 32 sought student perceptions of 
usability, 34 looked at usefulness and 35 asked students if they perceived a change in behaviour, 
achievement or skills. The revievv concluded vvith a list of recommendations for implement- 
ing reporting systems vvhich can also be used to provide a good structure for evaluating stu- 
dent-facing dashboards. These include questions on intended goals, visualisation techniques, 
information selection, needs assessment, usability testing, visual design, student perceptions, 
actual effects, and student use. 

The most recent published revievv of learning analytics dashboards vvas conducted by livet 
et al. (2017) and focused on hovr theories and models from the learning sciences have been 
used to inform the design of dashboards. From an initial sample of 95 papers that reported on 
student-facing dashboards, vvidgets or visualisations, the authors identified 26 of these studies 
that met the further criteria of being empirical and relying on educational concepts in their 
design. Across the included dashboards six educational concepts vvere identifted: cognitivism, 
constructivism, humanism, descriptive models, instructional design, and psychology. The most 
common goal of student-facing dashboards vvas to support avvareness and reflection vvhich, 
along vvith improving metacognitive skills, monitoring progress and supporting planning, vvere 
classifled as relating to metacognitive competence. The other three competences identifled vvere 
cognitive, behavioural and emotional. Each of these competencies relate to the core theory of 
self-regulated learning and the revievv authors suggest that dashboards should be complemented 
vvith tools that can help students vrho are struggling vvith their self-regulation to develop their 
skills. This revievv also raised the concern of the common use of student comparisons in dash- 
boards and suggests that using goal achievement as a standard for comparison could be a more 
pedagogically-sound approach than creating competition among students. Unlike the previous 
revievvs, the livet et al. (2017) revievv didnt specifically investigate evaluation of dashboards, but 
did make a recommendation that evaluation should be linked to the educational concepts that 
inform the dashboard design. 

VVhile each of the revievvs included here had a slight different focus or sample, a number of 
consistent themes emerged. The revievvs shoveed that there are many different vvays of design- 
ing dashboards and many different purposes for vvhich dashboards can be used. Hovvever, the 
details in the literature about theoretical foundations, design considerations and data specifica- 
tions vvere varied or, in some cases, missing. This makes it difficult to determine vvhether these 
elements vvere considered and yust not reported, or vvhether they vverent part of the dashboard 
design process. In their revievv, Schvvendimann et al. (2017) provide a checklist of elements 
that they recommend be included vvhen reporting on dashboard profect. In practice, this list 
can also be used as a checklist for the process of dashboard design. The checklist includes hav- 
ing a clear definition of learning dashboards, outlining the technologies used, the educational 
context, the evaluation approach (including hovv learning impacts vvere evaluated), and the 
resulting learner/teacher practices (Schvvendimann et al., 2017). Tvvo additions to this list can 
be made from suggestions from the Bodily and Verbert (2017) revievv, including a needs assess- 
ment prior to design and development, and a fustification of the visual techniques chosen to 
represent the data in the dashboard. 

It should also be noted that tvvo critical challenges vvere raised in the dashboard revievvs 
that need to be addressed by dashboard designers in order for dashboards to be effective in 
educational environments. The first of these are the ethical and privacy considerations around 
the use of student data to populate learning dashboards. Much has been vrritten about the need 
for strong ethical framevvorks to guide educational institutions on the protection of students” 
privacy vvhen developing learning analytics systems (Slade $: Prinsloo, 2013, Sclater, 2014). Tt is 
also imperative for the student voice to be included in discussions around the use of their data. 
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Recently several studies have emerged that have involved students in discussions around their 
vvillingness to share their data for the purposes of building learning analytics tool such as dash- 
boards (Brooker, Corrin, Mirriahi, 8 Fisher, 2017, Roberts et al,, 2016). VVhile many schools 
and universities have started to implement processes to protect and make ethical use of student 
data, there is still some vvay to go in ensuring this protection is universal. 

"The second challenge focuses on the emphasis some dashboard designs place on allovving stu- 
dents to compare their engagement and/or performance vvith their peers. The inclusion of com- 
parative elements, such as a class average, is common in learning dashboard design, especially 
in dashboard products developed by educational technology vendors. Hovvever, the literature 
on social comparison theory (Festinger, 1954), motivation (Pintrich, 2004), and self-regulation 
(Butler 6: VVinne, 1995) suggest that these comparative elements can have different effects on 
different students. This is an area that requires more research to determine hovv this can best be 
approached in terms of dashboard design. This should include studies in real educational envi- 
ronments over time to see, not only the short term effects on students” engagement vvith a single 
task or suböect, but also hovr this impacts the vvays students approach their study going forvvard. 

Approaches to the evaluation of the impact dashboards vary across the literature. Perhaps 
the most surprising outcome from the dashboard revlevvs vvas the fact that many studies either 
do not undertake evaluation or, if they do, do not report the evaluation outcomes in their vvork. 
Of those papers that did report evaluation findings, the main areas of evaluation focused on 
usability, usefulness, satisfaction, and effectiveness. VVhile the issues of usability, usefulness, and 
satisfaction are all very important in ensuring that dashboards are designed vvell, the issue of 
effectiveness is key to determining vvhether dashboards are a good mechanism for delivering 
feedback to students. In the Verbert et al. (2014) study, measures of effectiveness vvere said 
to include higher levels of engagement, higher performance in assessment, increased student 
retention, and improvements in self-assessment. Interestingly, no clear pattern of increase in 
these measures vvas seen across the studies in the dashboard reviev. In fact, several studies that 
measured changes in engagement or performance found no significant change (e.g., Morris, 
Piper, Cassanego, 8: VVinograd, 2005). So, despite the fact that students are generally happy to 
receive data and feedback via dashboards, there is inconsistent evidence about the impact these 
dashboards are having on student learning. 


Evaluation of Students” Interpretation of Dashboard Feedback 


Evaluation of students” interpretation of the feedback delivered through learning dashboards 
remains limited in the learning analytics literature. The ability to conduct this form of eval- 
uation faces tvvo main challenges. The first is to be able to gather data from students at the 
moment they interact vvith the dashboard, to understand hov they translate vvhat they are see- 
ing through the data visualisations into some form of action. The second challenge is to be able 
to track vvhether the student follovvs through vvith this action and vvhat impact this has on their 
learning. The ability to measure these tvvo things often goes beyond vvhat is currently captured 
in online systems and requires additional data collection, such as student self-report data. 

To evaluate students” interpretation of feedback delivered through dashboards it is import- 
ant to underpin the evaluation process vvith a strong theoretical framevvork. This can help to 
guide the evaluation design and identify the evidence required to determine hovv feedback vvas 
interpreted by students, and also vvhat actions resulted from this interpretation. VVhen evaluat- 
ing dashboards, it is not alvvays possible to measure the impact on student learning directly, as 
changes in learning performance can be influenced by many other factors in the educational 
environment. Hovvever, evaluation can be targeted at investigating the extent to vvhich the pur- 
pose of the dashboard has been achieved, for example, the impact that a dashboard has had on 
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students” ability to self-regulate their learning. The follovving tvvo case studies demonstrate dif- 
ferent approaches that have been taken to evaluating students” interpretations of dashboard visu- 
alisations and the impact these interpretations have had on students” approaches to their study. 


Case Study 1: Learning Analytics Visualisations for a Single Task 


The first study by Beheshitha, Hatala, Ga$evic, and Voksimovic (2016) vvas designed to investi- 
gate the effect of students” access to learning analytics visualisations on learning activity vvhile 
controlling for the motivational construct of achievement goal orientation. Situated in an 
authentic learning context in higher education, students vvere part of an experimental condi- 
tion vvhere they vvere shovvn one of three dashboard-style visualisations of their activity in an 
asynchronous online discussion. The three visualisations shovved the student their activity in 
relation to either the class average for posts, the top five contributors, or the number of key con- 
cepts included in posts. Log data vvas collected on the visualisation vievvs and posted messages 
in the learning management system. This vvas supplemented vvith a self-report survey using 
Elliot, Murayama, and Pekruns (2011) 3 x 2 achievement goal model to measure students” goal 
orientation. The analysis in this study involved discourse analysis of the discussion posts (using 
Coh-Metrix) and hierarchical linear mixed models for the statistical analysis of the variables 
from across the data sources. 

"The results of the study shovved different impacts on students” activity as a result of vievving 
different visualisations. For example, students vvith interpersonal achievement goals (i.e., a pref- 
erence to compare their vvork to the standard of others) vvho vievved the top contributors or key 
concepts visualisations subsequently vvent on to post more. VVhereas, those vrho had vieveed the 
class average visualisation posted less. Tn relation to the content of discussion posts, students 
vvith a self-avoidance goal orientation (i.e., a motivation to avoid doing the task vvorse than their 
ovvn previous vvork) had higher levels of narrativity, deep cohesion, syntactic simplicity, and 
referential cohesion in their posts after vievving the key concepts visualisations, but not vvhen 
vievving the class average or top contributors visualisations. 

This study highlights the complexity of designing learning dashboards for students vvho may 
have different motivations and goals for their study. Methodologically, this study demonstrated 
a quantitative approach to measuring changes in students” approaches to a learning task. "The 
ability to control for different achievement goal orientations provided a more sophisticated vievv 
of the impact of learning analytics visualisations on student behaviour and engagement vvith 
a task. The authors of the study suggest that further research involving other theoretically-in- 
formed constructs could help to build a more complete picture of the impact of dashboards and 
visualisations on student approaches to learning. 


Case Study 2: Learning Analytics Dashboard for Multiple Assessments and Tasks 


In contrast to the first study, the second case study employed a more qualitative approach to explor- 
ing students” interpretation of feedback delivered via a learning analytics dashboard (Corrin 6: de 
Barba, 2014, 2015). "The study used a semester-long, multi-phase, mixed methods approach to 
explore higher education students” interpretation of dashboard data and the actions they took 
in response, vvith reference to their self-regulated learning. The dashboard that vvas used as part 
of this study vvas built to replicate the common vvays of displaying student data used by learning 
management system vendors. Although the dashboard vvasnt live, it contained real data of par- 
ticipants” activities and performance. The 24 student participants vvere recruited from across tvvo 
discipline areas (science and languages). At the beginning of the semester the participants vvere 
asked to fill in a survey about their personal learning goals and motivations. "This vvas follovved 
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by an intervievv in vveek six of semester vvhere students vvere shovvn their dashboard and, using a 
think-aloud method, asked to explain their interpretation of the data visualisations. The students 
vvere also asked to outline any actions they might take as a result of seeing this data. A similar 
intervievv then took place in vveek 11 vrhere students vvere first asked to describe hovv seeing the 
data in the previous intervievv had impacted their approaches to study, before going through the 
same think-aloud process after seeing their updated dashboard. At the end of the semester, once 
they had received their final grade for the subyect, students vvere asked to fill in another survey to 
reflect on the impact that having this feedback had on their study throughout the semester. The 
survey also included questions about the usefulness of the visualisations in the dashboard. 

The outcomes of the research shovved a diversity in hovv students interpreted the dashboard 
data and their ability to determine suitable actions to take in response. VVhile the dashboard 
designs incorporated visualisations of both activities in the LMS and results from assessment 
tasks (see Figure 10.1), students focused primarily on the representation of summative assess- 
ment items. Some students vvere able to associate the effort they put into a learning activity or 
assessment vvith the results shovvn in the dashboard, vvhile others struggled to vvork out vrhat 
they needed to do in order to improve their performance. An aspect of the dashboard design 
that participants found useful vvas the fact that all the assessment and online learning activities 
vvere shovvn in one consolidated vievv. Effectively the dashboard acted as a map of the activities 
and assessments that students needed to complete throughout the semester. Students vvere able 
to use the dashboard layout and feedback provided to plan their study schedules and identify 
tasks that they may have missed. 

For each of the assessment tasks and the learning management system access statistics a class 
average vvas given as a standard for comparison. The ability to see their activity in relation to the 
average had a range of impacts on students depending on their different motivations and goals. 
Those vvhose performance sat belovr the class average tended to feel it vvasnt useful to compare 
their vvork vvith others. Those substantially higher than the class average reported that it vvas 
good to knov that their vvork vvas of high standard, but that the average did not influence their 
actions going forvvard. Those students vvhose performance vvas close to the class average either 
vvere happy that their vvork vvas comparable vvith others or savv this as a motivation to try harder. 
Interestingly, some of these students vvho vvere happy vvith their average-level performance and 
didnt see a need to change their study strategies, had expressed a higher performance goal at 
the beginning of the semester. VVhat this meant vvas that by seeing their average performance 
on the dashboard they had been distracted from their original goal. VVhile this is only a small 
study vvith a small sample, this particular finding indicates that more research is required to 
determine the broad influence of these comparative standards on students” interpretation of 
dashboard feedback and motivation. 

Both these case studies highlight that there are many individual differences in hovv students 
approach their studies that can impact the interpretation they make of feedback delivered via 
learning analytics dashboards or visualisations. Another strong theme that emerged from these 
tvvo studies vvas the importance of the pedagogical design of learning tasks and assessments in 
hovr students interpret the visualised feedback. VVhile the students may knovr the process, they 
follovved to complete a learning or assessment task, this doesnt alvvays translate into a strong 
understanding of the pedagogical intent behind the task, vhich is important in helping stu- 
dents to identify appropriate actions to take to improve their performance. These studies also 
demonstrate that in order to gain a fuller understanding of the impact of dashboards on student 
interpretation research needs to go beyond a single data source (e.g., the log data from a learn- 
ing management system) to incorporate multiple methods of data collection. 

Emerging research into student dashboards has begun to uncover many issues that dash- 
board designers and teachers need to take into consideration vvhen designing and implementing 
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Figure 10.1 Science subiect dashboard from the Corrin and de Barba (2015) study. 


dashboards in a vay that can enhance the learning experience for students. Like any nevr field, 
there is still lots to learn and the popularity of this form of feedback provision vvill hopefully 
inspire more research in this area. İt is also important to look to other fields vvhere the use of 
dashboards and reporting of educational data are more established. Once such area is that of 
score reporting. Consideration of existing research on score reporting is rare in the learning 
analytics literature. The next section of this chapter explores vvhat lessons can be learnt from the 
score reporting literature to guide the design and evaluation of learning analytics dashboards 
for students. 
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Figure 10.1 (Continued) 


Lessons From the Field of Score Reporting 


The field of score reporting has a long history of exploring effective vvays to communicate infor- 
mation about student learning and the impact of the curriculum to students, teachers, parents, 
and educational administrators (Ryan, 2006). The design and evaluation of score reports are 
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often guided by national standards. For example, in the United States the American Educa- 
tional Research Association, American Psychological Association, and the National Council on 
Measurement in Education vvork in partnership to produce the Sfandards of Educational and 
Psychological Testing. These standards address issues such as the validity, reliability, and fairness 
of testing and the reporting of testing scores. 

Particularly relevant to the area of student dashboards is the requirement in the standards 
for score reports to be accompanied by supporting documentation that help the report audience 
to interpret the contents of the report. Across the revfevvs of learning analytics dashboards very 
little mention vvas made of supporting resources for interpretation of dashboard visualisations. 
Some dashboards vvere designed vvith the explicit purpose to be used in conversation betvveen 
student and teachers or academic advisors (e.g., Aguilar, Lonn 6: Teasley, 2014), but many vvere 
designed to promote student self-reflection. The fact that these dashboards are commonily deliv- 
ered online and accessible at any time vvould suggest that support for interpretation should be 
built into the dashboard itself (or available in a linked resource) rather than being reliant on 
conversations vvith teaching staff. VVhile some support for interpretation can be built into the 
visualisation itself, information about the design of the learning and assessment activities should 
also be provided. As seen in the tvvo case studies above, an understanding the pedagogical con- 
text of the data presented in dashboards is vital to the process of making an interpretation. 

In relation to the validity and design of score reports, several large revievvs of practice have 
been conducted vvhich set forvvard recommendations for the design of score reports. Hattie 
(2009) proposes 15 principles to maximise the ability of the reader to make appropriate inter- 
pretations. These principles address the validity of score reports by suggesting that there should 
be minimal use of numbers and an effort made not to make the interface too cluttered. Tt is 
suggested that each report should have a theme and should be designed to ansvver specific ques- 
tions. Among the principles are suggestions for support materials that provide a yustification for 
the assessment design. Hattie also calls for evidence to demonstrate hovv audiences interpret the 
reports, in particular, an exploration of vrhat the audience sees and vvhat action they vvill take 
next. Similar themes vvere observed in a revievv by Goodman and Hambleton (2004) vvho also 
provide more specific recommendations on the visual design elements of score reports, such 
as the grouping of data in meaningful vvays and the highlighting of main findings using boxes 
and graphics. The value of piloting reports is also included as a recommendation of this revievv. 
In this volume, Tannenbaum adds to the discussion around validity by setting forvard strate- 
gies for alignment betvveen assessment and score reports, design decision-points, and steps for 
report development. The useful perspective provided around setting the criteria for inclusion 
of data in a report can be used to inform hovv criteria could be set for the inclusion of data in a 
dashboard, vvith particular reference to vvhether the purpose of the dashboard is summative or 
formative in nature. 

Further recommendations for the delivery of online static and interactive scores reports are 
provided by Zenisky and Hambleton (2012b). The online format provides opportunities to allovv 
users to take more control over their exploration and interpretation of score reports through 
the use of links and subpages to provide more detail about the data presented. The online and 
interactive nature of student dashboards can also provide this opportunity, yet to date fevv dash- 
boards have been built in a vvay that provides this functionality to students. In a recent study of 
student perspectives on technology-supported feedback several participants expressed a desire 
to be able to drill dovvn into more detail about the concepts, competencies or skills they need to 
focus on to improve their assessment performance (Corrin 6z de Barba, 2018). In the vvay that 
subscores can provide a more detailed picture of areas in need of development in score reports, 
the ability to classify assessment questions in terms of areas of knovrledge or skill could help 
students to better interpret the marks and grades they can vievv through dashboards so they can 
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determine appropriate actions to take. Yet the challenge, particularly in the more generic dash- 
boards being delivered through learning management systems, is to get teachers to perform this 
classification and for adequate testing of subscores to be undertaken (Sinharay, 2010). Tt is also 
necessary for the dashboard to support the functionality of classification and the interactivity of 
drilling dovvn to see this greater level of detail. 

"The models of evaluation used to test interpretations of score reports prior to and post release 
have the potential to provide useful guidance to designers of student dashboards. A research- 
based model outlined by Zapata-Rivera and VanVVinkle (2010) involves the gathering of assess- 
ment information needs, the reconciliation of these needs vvith the score reporting needs, the 
design of a score report prototype, an internal evaluation (vvith experts in subfect matter, usabil- 
ity, measurement and accessibility), and an external evaluation (vvith representatives of the 
intended audience). Iterations of these steps can be taken as many times as needed to develop 
the most useful score report. Similarly, Zenisky and Hambletons (2012a) score report develop- 
ment model advocates field testing vvith potential user groups in controlled studies as vvell as 
the development of programmes of ongoing monitoring and maintenance. İt is also important 
that the evaluation extends beyond the score reports themselves to the interpretation support 
materials created. For example, Zapata-Rivera, Zvvick and Vezzu (2016) conducted an evalua- 
tion of the usefulness of a tutorial designed to help teachers to understand representations of 
measurement error in score reports. In addition to questions about usability, the participants 
vvere asked to complete a comprehension questionnaire to assess their understanding of the 
concepts covered. The development of similar instruments to investigate students” interpreta- 
tion of dashboard visualisations, and any supporting materials could be useful for dashboard 
design and delivery. 


Conclusion 


Research into student-facing analytics feedback is still in its early days and the literature on stu- 
dent dashboards is currently not mature enough to be able to provide authoritative guidance on 
the most effective visual elements to assist student learning. Hovvever, ma/or themes are emerg- 
ing around the importance of the theoretical foundation behind the purpose of dashboards and 
pedagogical design of the learning activities included in the dashboard in providing feedback to 
students. The research has also shovvn that student characteristics, such as goal orientation and 
motivation can have a considerable influence on hovr dashboards are interpreted by students. 
Designing dashboards to address these issues presents a particular challenge to educational 
technology vendors vvho often seek to provide a “one-size-fits-alP” dashboard product to institu- 
tions. The emerging research vvould indicate that this approach, as Teasley (2017, p. 6) suggests, 
“may be unvise. 

So, if vve return to vvhere vve started, to the history of dashboards and their role in supporting 
decision-making in business and engineering, vve see that the audience for this form of feedback 
are experts in their field. Hovvever, vvhile students have experience at being students, they are 
not experts in education. They often lack sufficient knovvledge about the pedagogical intent of 
learning activities, the design of assessment, and an understand of hov this all fits vvithin the 
broader curriculum. Additionally, they may not have adequate levels of data literacy to be able 
to understand the statistics and visualisations presented in dashboards and reports (MacNeill, 
Campbell, 8: Havvksey, 2014, Vezzu, VanVVinkle, 8: Zapata-Rivera, 2012). 

"These challenges highlight the importance of the provision of support for the interpretation 
of feedback given through learning analytics dashboards. This can be done in a number of vvays 
including the incorporation of visual elements and descriptions to accompany data represen- 
tations, the provision of supporting materials, or the provision of face-to-face support from 
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teachers and/or academic advisors. Tactics such as the use of evidence-based or stealth assess- 
ment designs could also help to strengthen students” understanding of learning obyiectives and 
expectations of performance (Shute 8: Kim, 2014). 

Important lessons can also be learnt from the vvork on creating reports for “open learner 
models” vvhich looks at vvays to provide information to students about the content and skills 
associated vvith educational systems such as intelligent tutoring systems (Bull $x Kay, 2016, Bull, 
VVasson, lohnson, Petters, 8: Hansen, 2012). "This vvork demonstrates hovv visualisations of data 
can be built vyithin the context of a learning activity design and adapted for students in vvays 
that allovv them to explore different levels of granularity, effectively creating an “active report” 
(Zapata-Rivera, Hansen, Shute, Undervvood, 8 Bauer, 2007). The value in these approaches is 
that visualisations are grounded vvithin a model that clearly represents the design of the learn- 
ing activity, helping to support and raise students” metacognitive avvareness (Vatrapu, Teplovs, 
Fuiita, 8: Bull, 2011). 

In designing such support and evaluating the interpretations that students make of feedback 
delivered through dashboards the field of learning analytics vvould be vvise to look to vvork 
already done in other disciplines. As has been outlined in this chapter, the field of score report- 
ing provides useful information on vvays that feedback can be visualised as vvell as models for 
the evaluation of student interpretations of this feedback. This literature and the emerging liter- 
ature on evaluation of dashboards in the field of learning analytics suggests a move beyond sim- 
ple measures of student satisfaction and usability tovvards methods vvhich can capture students” 
understanding of the feedback as vvell as the vvays that they translate these understandings into 
action. The proposed stages of score report development models presented here (Zapata-Ri- 
vera 6z VanVVinkle, 2010, Zenisky 8: Hambletons, 2012a) are also useful to ensure that evalua- 
tion is built in across the vvhole dashboard development process. There is lots of vrork still to be 
done in determining the most effective vvays of designing student dashboards, but by dravving 
on these evaluation models as vvell as educational theories and learning designs, institutions 
and teachers vvill be better placed to design dashboards that provide useful feedback to support 
student learning. 
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